How to Structure Your Website for AI Crawlers

AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) read HTML source, not rendered JavaScript. Sites that rely on client-side rendering are invisible to AI search. Static HTML, clean heading hierarchy, answer capsules, and proper robots.txt are the foundations of an AI-visible website.

The fundamental problem: JavaScript rendering

Most AI crawlers do not execute JavaScript. They read raw HTML source code. This creates a massive visibility gap:

Rendering approach	AI crawler visibility	Common platforms
Static HTML / SSG	Full visibility	Astro, Hugo, Eleventy, Jekyll
Server-side rendered (SSR)	Full visibility	Next.js (SSR mode), Nuxt, Astro
Static export from SSR	Full visibility	Next.js (static export), Gatsby
Client-side rendered (CSR)	Minimal to zero	React SPA, Vue SPA, Angular SPA
Heavy JS WordPress themes	Partial - depends on theme	WordPress with Elementor, Divi, WPBakery

If your content only appears after JavaScript runs, AI crawlers cannot see it. This affects GPTBot (ChatGPT), ClaudeBot (Claude), PerplexityBot (Perplexity), and most other AI crawlers. You need Static Site Generation (SSG) or Server-Side Rendering (SSR) for AI visibility.

How to test what AI crawlers see

View page source (not inspect element) - this is what crawlers read
Disable JavaScript in your browser and reload - this is what crawlers see
Use curl https://yoursite.com/page in terminal - this returns raw HTML
If your content disappears in any of these tests, AI crawlers cannot see it

Robots.txt configuration for AI crawlers

Your robots.txt controls which AI crawlers access your content. Many sites block them without realising it, through wildcard rules or security plugins.

Use a 3-category system: block training bots, allow search bots, and allow user-triggered bots. This protects your content from AI training while keeping full AI search visibility.

# CATEGORY 1: BLOCK TRAINING CRAWLERS
# These collect data to train AI models - block if you want to protect content
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# CATEGORY 2: ALLOW SEARCH CRAWLERS
# These power real-time AI search results - blocking removes you from AI search
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Amazonbot
Allow: /

# CATEGORY 3: ALLOW USER-TRIGGERED CRAWLERS
# These fetch pages when users share URLs or browse via AI
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# Block sensitive directories from all bots
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/

Common robots.txt mistakes

Wildcard blocking: User-agent: * / Disallow: / blocks everything including AI crawlers
Security plugin defaults: WordPress security plugins often block unknown user agents
Blocking all AI crawlers: The old "allow all" or "block all" approach is outdated - use the 3-category system to block training while allowing search
Forgetting search-specific bots: OAI-SearchBot and Claude-SearchBot are separate from GPTBot and ClaudeBot - blocking the training bot doesn't block the search bot, and vice versa
Missing Brave indexing for Claude: Claude uses Brave Search - ensure your site is indexed in Brave, not just Google/Bing

The answer capsule format

An answer capsule is a 40-60 word factual paragraph placed right after a heading. It gives a direct answer to the question the heading implies. AI platforms extract these as citation-ready content. Pages using this format see higher citation rates across ChatGPT, Gemini, and AI Overviews.

Answer capsule structure

Placement: Immediately after the H2 or H3 heading
Length: 40-60 words (concise enough for extraction)
Content: Direct factual answer with specific data points
Formatting: Bold the first sentence or the entire capsule
CSS class: Use .answer-capsule for Speakable schema targeting

Example

After a heading "How much does AI search optimisation cost?", the answer capsule would be:

"AI search optimisation costs between £500-£5,000 per month from specialist agencies. The price depends on scope, competition, and the number of AI platforms targeted. Most UK agencies charge separately for audit, implementation, and ongoing monitoring."

Heading hierarchy for AI extraction

AI crawlers use heading hierarchy to understand content structure and extract relevant sections. Follow these rules:

Rule	Why it matters
One H1 per page	Defines the primary topic for AI extraction
H2 for major sections	Each H2 should be independently answerable
H3 for subsections	Provides granular extraction targets
No skipped levels	Don't jump from H2 to H4 - breaks hierarchy logic
Declarative headings (preferred)	Declarative headings give AI a direct statement to cite; pair them with question headings to match user queries
Answer capsule after each H2	Gives AI a citation-ready extract per section

Optimal section length

120-180 words per section is optimal for AI extraction. Sections in this range are long enough for a complete answer but short enough for clean extraction; much shorter or much longer sections are harder for AI to extract cleanly.

Page speed and AI citations

Fast-loading pages are easier for AI crawlers to fetch and index. Both AI crawlers and AI search platforms factor in page speed when selecting sources, so a quick First Contentful Paint and low render-blocking budget help your content stay accessible.

One idea per paragraph

AI models process content at the paragraph level. Long paragraphs with multiple ideas cause extraction confusion. Keep paragraphs focused:

One claim per paragraph - don't bundle multiple statistics or facts
2-4 sentences maximum - shorter is easier to extract
Lead with the fact - put the key information in the first sentence
Avoid transition fluff - "As we discussed earlier" adds nothing for AI crawlers

Content freshness signals

76.4% of ChatGPT-cited pages were updated within 30 days. Freshness is a real citation factor. Implement these:

dateModified in schema - update this whenever you revise content
Visible "Last updated" date on the page - AI crawlers read this
Genuine content updates - don't just change the date, actually revise the content
Regular content audits - review and update key pages at least monthly

llms.txt - the machine-readable index

llms.txt is an emerging standard that gives AI models a readable index of your key content. Like robots.txt tells crawlers what they can access, llms.txt tells AI models what to prioritise. Place it at your domain root next to robots.txt and sitemap.xml.

# Example llms.txt
# Your Company Name
# https://example.com

## About
> Brief description of your company and what you do.

## Key Pages
- [Homepage](https://example.com/)
- [About Us](https://example.com/about/)
- [Services](https://example.com/services/)
- [Contact](https://example.com/contact/)

## Expertise Areas
- [Topic Area 1](https://example.com/topic-1/)
- [Topic Area 2](https://example.com/topic-2/)

## FAQs
- [Common Questions](https://example.com/faq/)

IndexNow protocol

IndexNow notifies Bing (and therefore ChatGPT) when you publish or update content. Without it, you wait for Bing to find changes through normal crawling.

Supported by: Bing, Yandex, Seznam, Naver
Not supported by: Google (uses its own systems)
Impact: Near-instant Bing indexation, which feeds ChatGPT and Copilot
Implementation: API call or plugin (WordPress, Cloudflare Workers)

Bing Webmaster Tools submission

Since ChatGPT uses Bing's index, submitting your sitemap to Bing Webmaster Tools is essential. Many businesses only submit to Google Search Console and miss Bing entirely.

Go to bing.com/webmasters
Add your site and verify ownership
Submit your XML sitemap
Enable IndexNow for instant update notifications
Monitor crawl errors and coverage

The Astro + Cloudflare advantage

Static site generators like Astro, combined with edge deployment on Cloudflare, create the ideal architecture for AI visibility:

Pre-rendered HTML - every page is static, fully readable by all crawlers
No JavaScript dependency - content exists in the HTML source
Edge caching - fast response times from global CDN
Markdown for Agents - Cloudflare's feature that serves clean markdown to AI crawlers
Lighthouse scores 95+ - compared to WordPress average of 40-70

This site is built on Astro and deployed to Cloudflare - you can read about our methodology.

Technical checklist

Item	Priority	Status check
Static HTML or SSR rendering	Critical	View source - is content visible?
Allow AI crawlers in robots.txt	Critical	Check for GPTBot, ClaudeBot, PerplexityBot
Submit sitemap to Bing	High	Bing Webmaster Tools dashboard
Implement IndexNow	High	Test with Bing URL Submission API
Answer capsules after headings	High	40-60 word factual paragraphs
Clean heading hierarchy	High	H1 > H2 > H3, no skipped levels
One idea per paragraph	Medium	2-4 sentences, lead with the fact
Schema markup	High	Google Rich Results Test
Create llms.txt	Medium	File at domain root
Content freshness dates	Medium	dateModified in schema + visible date