How to Structure Your Website for AI Crawlers
AI crawlers read HTML, not JavaScript. They need static content, clean heading hierarchy, and machine-readable structure. How to build for AI.
AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) read HTML source, not rendered JavaScript. Sites that rely on client-side rendering are invisible to AI search. Static HTML, clean heading hierarchy, answer capsules, and proper robots.txt are the foundations of an AI-visible website.
The fundamental problem: JavaScript rendering
Most AI crawlers do not execute JavaScript. They read raw HTML source code. This creates a massive visibility gap:
| Rendering approach | AI crawler visibility | Common platforms |
|---|---|---|
| Static HTML / SSG | Full visibility | Astro, Hugo, Eleventy, Jekyll |
| Server-side rendered (SSR) | Full visibility | Next.js (SSR mode), Nuxt, Astro |
| Static export from SSR | Full visibility | Next.js (static export), Gatsby |
| Client-side rendered (CSR) | Minimal to zero | React SPA, Vue SPA, Angular SPA |
| Heavy JS WordPress themes | Partial - depends on theme | WordPress with Elementor, Divi, WPBakery |
If your content only appears after JavaScript runs, AI crawlers cannot see it. This affects GPTBot (ChatGPT), ClaudeBot (Claude), PerplexityBot (Perplexity), and most other AI crawlers. You need Static Site Generation (SSG) or Server-Side Rendering (SSR) for AI visibility.
How to test what AI crawlers see
- View page source (not inspect element) - this is what crawlers read
- Disable JavaScript in your browser and reload - this is what crawlers see
- Use
curl https://yoursite.com/pagein terminal - this returns raw HTML - If your content disappears in any of these tests, AI crawlers cannot see it
Robots.txt configuration for AI crawlers
Your robots.txt controls which AI crawlers access your content. Many sites block them without realising it, through wildcard rules or security plugins.
Use a 3-category system: block training bots, allow search bots, and allow user-triggered bots. This protects your content from AI training while keeping full AI search visibility.
# CATEGORY 1: BLOCK TRAINING CRAWLERS
# These collect data to train AI models - block if you want to protect content
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
# CATEGORY 2: ALLOW SEARCH CRAWLERS
# These power real-time AI search results - blocking removes you from AI search
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Amazonbot
Allow: /
# CATEGORY 3: ALLOW USER-TRIGGERED CRAWLERS
# These fetch pages when users share URLs or browse via AI
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: anthropic-ai
Allow: /
# Block sensitive directories from all bots
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/ Common robots.txt mistakes
- Wildcard blocking:
User-agent: * / Disallow: /blocks everything including AI crawlers - Security plugin defaults: WordPress security plugins often block unknown user agents
- Blocking all AI crawlers: The old "allow all" or "block all" approach is outdated - use the 3-category system to block training while allowing search
- Forgetting search-specific bots: OAI-SearchBot and Claude-SearchBot are separate from GPTBot and ClaudeBot - blocking the training bot doesn't block the search bot, and vice versa
- Missing Brave indexing for Claude: Claude uses Brave Search - ensure your site is indexed in Brave, not just Google/Bing
See also: What should my robots.txt look like for AI search?
The answer capsule format
An answer capsule is a 40-60 word factual paragraph placed right after a heading. It gives a direct answer to the question the heading implies. AI platforms extract these as citation-ready content. Pages using this format see higher citation rates across ChatGPT, Gemini, and AI Overviews.
Answer capsule structure
- Placement: Immediately after the H2 or H3 heading
- Length: 40-60 words (concise enough for extraction)
- Content: Direct factual answer with specific data points
- Formatting: Bold the first sentence or the entire capsule
- CSS class: Use
.answer-capsulefor Speakable schema targeting
Example
After a heading "How much does AI search optimisation cost?", the answer capsule would be:
"AI search optimisation costs between £500-£5,000 per month from specialist agencies. The price depends on scope, competition, and the number of AI platforms targeted. Most UK agencies charge separately for audit, implementation, and ongoing monitoring."
Heading hierarchy for AI extraction
AI crawlers use heading hierarchy to understand content structure and extract relevant sections. Follow these rules:
| Rule | Why it matters |
|---|---|
| One H1 per page | Defines the primary topic for AI extraction |
| H2 for major sections | Each H2 should be independently answerable |
| H3 for subsections | Provides granular extraction targets |
| No skipped levels | Don't jump from H2 to H4 - breaks hierarchy logic |
| Declarative headings (preferred) | Recent data shows declarative headings average 4.3 citations vs 3.4 for question headings |
| Answer capsule after each H2 | Gives AI a citation-ready extract per section |
Optimal section length
120-180 words per section is optimal for AI extraction. Sections in this range deliver 70% more citations than shorter or longer ones. This is long enough for a complete answer but short enough for clean extraction.
Page speed and AI citations
FCP under 0.4 seconds correlates with 3x more citations. Fast pages average 6.7 AI citations vs 2.1 for slow pages. Both AI crawlers and AI search platforms factor in page speed when selecting sources.
One idea per paragraph
AI models process content at the paragraph level. Long paragraphs with multiple ideas cause extraction confusion. Keep paragraphs focused:
- One claim per paragraph - don't bundle multiple statistics or facts
- 2-4 sentences maximum - shorter is easier to extract
- Lead with the fact - put the key information in the first sentence
- Avoid transition fluff - "As we discussed earlier" adds nothing for AI crawlers
Content freshness signals
76.4% of ChatGPT-cited pages were updated within 30 days. Freshness is a real citation factor. Implement these:
- dateModified in schema - update this whenever you revise content
- Visible "Last updated" date on the page - AI crawlers read this
- Genuine content updates - don't just change the date, actually revise the content
- Regular content audits - review and update key pages at least monthly
llms.txt - the machine-readable index
llms.txt is an emerging standard that gives AI models a readable index of your key content. Like robots.txt tells crawlers what they can access, llms.txt tells AI models what to prioritise. Place it at your domain root next to robots.txt and sitemap.xml.
# Example llms.txt
# Your Company Name
# https://example.com
## About
> Brief description of your company and what you do.
## Key Pages
- [Homepage](https://example.com/)
- [About Us](https://example.com/about/)
- [Services](https://example.com/services/)
- [Contact](https://example.com/contact/)
## Expertise Areas
- [Topic Area 1](https://example.com/topic-1/)
- [Topic Area 2](https://example.com/topic-2/)
## FAQs
- [Common Questions](https://example.com/faq/) IndexNow protocol
IndexNow notifies Bing (and therefore ChatGPT) when you publish or update content. Without it, you wait for Bing to find changes through normal crawling.
- Supported by: Bing, Yandex, Seznam, Naver
- Not supported by: Google (uses its own systems)
- Impact: Near-instant Bing indexation, which feeds ChatGPT and Copilot
- Implementation: API call or plugin (WordPress, Cloudflare Workers)
Bing Webmaster Tools submission
Since ChatGPT uses Bing's index, submitting your sitemap to Bing Webmaster Tools is essential. Many businesses only submit to Google Search Console and miss Bing entirely.
- Go to bing.com/webmasters
- Add your site and verify ownership
- Submit your XML sitemap
- Enable IndexNow for instant update notifications
- Monitor crawl errors and coverage
The Astro + Cloudflare advantage
Static site generators like Astro, combined with edge deployment on Cloudflare, create the ideal architecture for AI visibility:
- Pre-rendered HTML - every page is static, fully readable by all crawlers
- No JavaScript dependency - content exists in the HTML source
- Edge caching - fast response times from global CDN
- Markdown for Agents - Cloudflare's feature that serves clean markdown to AI crawlers
- Lighthouse scores 95+ - compared to WordPress average of 40-70
This site is built on Astro and deployed to Cloudflare - you can read about our methodology.
Technical checklist
| Item | Priority | Status check |
|---|---|---|
| Static HTML or SSR rendering | Critical | View source - is content visible? |
| Allow AI crawlers in robots.txt | Critical | Check for GPTBot, ClaudeBot, PerplexityBot |
| Submit sitemap to Bing | High | Bing Webmaster Tools dashboard |
| Implement IndexNow | High | Test with Bing URL Submission API |
| Answer capsules after headings | High | 40-60 word factual paragraphs |
| Clean heading hierarchy | High | H1 > H2 > H3, no skipped levels |
| One idea per paragraph | Medium | 2-4 sentences, lead with the fact |
| Schema markup | High | Google Rich Results Test |
| Create llms.txt | Medium | File at domain root |
| Content freshness dates | Medium | dateModified in schema + visible date |
What to do next
Oliver Mackman
AI Search Analyst, SEOCompare
Oliver leads SEOCompare's editorial and comparison research. With over a decade in digital marketing, he oversees agency evaluation, tool testing, and AI search data analysis.
Last reviewed: 7 April 2026
Need help with AI search visibility?
Get a free AI visibility audit to see how your business appears across ChatGPT, Gemini, Perplexity, and AI Overviews.
Request your free audit