Technical Guide Last updated: May 2026

How to Structure Your Website for AI Crawlers

AI crawlers read HTML, not JavaScript. They need static content, clean heading hierarchy, and machine-readable structure. How to build for AI.

OM
Oliver Mackman
AI Search Analyst

AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) read HTML source, not rendered JavaScript. Sites that rely on client-side rendering are invisible to AI search. Static HTML, clean heading hierarchy, answer capsules, and proper robots.txt are the foundations of an AI-visible website.

How to structure your website for AI crawlers

The fundamental problem: JavaScript rendering

Most AI crawlers do not execute JavaScript. They read raw HTML source code. This creates a massive visibility gap:

Rendering approachAI crawler visibilityCommon platforms
Static HTML / SSGFull visibilityAstro, Hugo, Eleventy, Jekyll
Server-side rendered (SSR)Full visibilityNext.js (SSR mode), Nuxt, Astro
Static export from SSRFull visibilityNext.js (static export), Gatsby
Client-side rendered (CSR)Minimal to zeroReact SPA, Vue SPA, Angular SPA
Heavy JS WordPress themesPartial - depends on themeWordPress with Elementor, Divi, WPBakery

If your content only appears after JavaScript runs, AI crawlers cannot see it. This affects GPTBot (ChatGPT), ClaudeBot (Claude), PerplexityBot (Perplexity), and most other AI crawlers. You need Static Site Generation (SSG) or Server-Side Rendering (SSR) for AI visibility.

How to test what AI crawlers see

  1. View page source (not inspect element) - this is what crawlers read
  2. Disable JavaScript in your browser and reload - this is what crawlers see
  3. Use curl https://yoursite.com/page in terminal - this returns raw HTML
  4. If your content disappears in any of these tests, AI crawlers cannot see it

Robots.txt configuration for AI crawlers

Your robots.txt controls which AI crawlers access your content. Many sites block them without realising it, through wildcard rules or security plugins.

Use a 3-category system: block training bots, allow search bots, and allow user-triggered bots. This protects your content from AI training while keeping full AI search visibility.

# CATEGORY 1: BLOCK TRAINING CRAWLERS
# These collect data to train AI models - block if you want to protect content
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# CATEGORY 2: ALLOW SEARCH CRAWLERS
# These power real-time AI search results - blocking removes you from AI search
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Amazonbot
Allow: /

# CATEGORY 3: ALLOW USER-TRIGGERED CRAWLERS
# These fetch pages when users share URLs or browse via AI
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# Block sensitive directories from all bots
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /staging/

Common robots.txt mistakes

  • Wildcard blocking: User-agent: * / Disallow: / blocks everything including AI crawlers
  • Security plugin defaults: WordPress security plugins often block unknown user agents
  • Blocking all AI crawlers: The old "allow all" or "block all" approach is outdated - use the 3-category system to block training while allowing search
  • Forgetting search-specific bots: OAI-SearchBot and Claude-SearchBot are separate from GPTBot and ClaudeBot - blocking the training bot doesn't block the search bot, and vice versa
  • Missing Brave indexing for Claude: Claude uses Brave Search - ensure your site is indexed in Brave, not just Google/Bing

See also: What should my robots.txt look like for AI search?

The answer capsule format

An answer capsule is a 40-60 word factual paragraph placed right after a heading. It gives a direct answer to the question the heading implies. AI platforms extract these as citation-ready content. Pages using this format see higher citation rates across ChatGPT, Gemini, and AI Overviews.

Answer capsule structure

  • Placement: Immediately after the H2 or H3 heading
  • Length: 40-60 words (concise enough for extraction)
  • Content: Direct factual answer with specific data points
  • Formatting: Bold the first sentence or the entire capsule
  • CSS class: Use .answer-capsule for Speakable schema targeting

Example

After a heading "How much does AI search optimisation cost?", the answer capsule would be:

"AI search optimisation costs between £500-£5,000 per month from specialist agencies. The price depends on scope, competition, and the number of AI platforms targeted. Most UK agencies charge separately for audit, implementation, and ongoing monitoring."

Heading hierarchy for AI extraction

AI crawlers use heading hierarchy to understand content structure and extract relevant sections. Follow these rules:

RuleWhy it matters
One H1 per pageDefines the primary topic for AI extraction
H2 for major sectionsEach H2 should be independently answerable
H3 for subsectionsProvides granular extraction targets
No skipped levelsDon't jump from H2 to H4 - breaks hierarchy logic
Declarative headings (preferred)Recent data shows declarative headings average 4.3 citations vs 3.4 for question headings
Answer capsule after each H2Gives AI a citation-ready extract per section

Optimal section length

120-180 words per section is optimal for AI extraction. Sections in this range deliver 70% more citations than shorter or longer ones. This is long enough for a complete answer but short enough for clean extraction.

Page speed and AI citations

FCP under 0.4 seconds correlates with 3x more citations. Fast pages average 6.7 AI citations vs 2.1 for slow pages. Both AI crawlers and AI search platforms factor in page speed when selecting sources.

One idea per paragraph

AI models process content at the paragraph level. Long paragraphs with multiple ideas cause extraction confusion. Keep paragraphs focused:

  • One claim per paragraph - don't bundle multiple statistics or facts
  • 2-4 sentences maximum - shorter is easier to extract
  • Lead with the fact - put the key information in the first sentence
  • Avoid transition fluff - "As we discussed earlier" adds nothing for AI crawlers

Content freshness signals

76.4% of ChatGPT-cited pages were updated within 30 days. Freshness is a real citation factor. Implement these:

  • dateModified in schema - update this whenever you revise content
  • Visible "Last updated" date on the page - AI crawlers read this
  • Genuine content updates - don't just change the date, actually revise the content
  • Regular content audits - review and update key pages at least monthly

llms.txt - the machine-readable index

llms.txt is an emerging standard that gives AI models a readable index of your key content. Like robots.txt tells crawlers what they can access, llms.txt tells AI models what to prioritise. Place it at your domain root next to robots.txt and sitemap.xml.

# Example llms.txt
# Your Company Name
# https://example.com

## About
> Brief description of your company and what you do.

## Key Pages
- [Homepage](https://example.com/)
- [About Us](https://example.com/about/)
- [Services](https://example.com/services/)
- [Contact](https://example.com/contact/)

## Expertise Areas
- [Topic Area 1](https://example.com/topic-1/)
- [Topic Area 2](https://example.com/topic-2/)

## FAQs
- [Common Questions](https://example.com/faq/)

IndexNow protocol

IndexNow notifies Bing (and therefore ChatGPT) when you publish or update content. Without it, you wait for Bing to find changes through normal crawling.

  • Supported by: Bing, Yandex, Seznam, Naver
  • Not supported by: Google (uses its own systems)
  • Impact: Near-instant Bing indexation, which feeds ChatGPT and Copilot
  • Implementation: API call or plugin (WordPress, Cloudflare Workers)

Bing Webmaster Tools submission

Since ChatGPT uses Bing's index, submitting your sitemap to Bing Webmaster Tools is essential. Many businesses only submit to Google Search Console and miss Bing entirely.

  1. Go to bing.com/webmasters
  2. Add your site and verify ownership
  3. Submit your XML sitemap
  4. Enable IndexNow for instant update notifications
  5. Monitor crawl errors and coverage

The Astro + Cloudflare advantage

Static site generators like Astro, combined with edge deployment on Cloudflare, create the ideal architecture for AI visibility:

  • Pre-rendered HTML - every page is static, fully readable by all crawlers
  • No JavaScript dependency - content exists in the HTML source
  • Edge caching - fast response times from global CDN
  • Markdown for Agents - Cloudflare's feature that serves clean markdown to AI crawlers
  • Lighthouse scores 95+ - compared to WordPress average of 40-70

This site is built on Astro and deployed to Cloudflare - you can read about our methodology.

Technical checklist

ItemPriorityStatus check
Static HTML or SSR renderingCriticalView source - is content visible?
Allow AI crawlers in robots.txtCriticalCheck for GPTBot, ClaudeBot, PerplexityBot
Submit sitemap to BingHighBing Webmaster Tools dashboard
Implement IndexNowHighTest with Bing URL Submission API
Answer capsules after headingsHigh40-60 word factual paragraphs
Clean heading hierarchyHighH1 > H2 > H3, no skipped levels
One idea per paragraphMedium2-4 sentences, lead with the fact
Schema markupHighGoogle Rich Results Test
Create llms.txtMediumFile at domain root
Content freshness datesMediumdateModified in schema + visible date

What to do next

OM

Oliver Mackman

AI Search Analyst, SEOCompare

Oliver leads SEOCompare's editorial and comparison research. With over a decade in digital marketing, he oversees agency evaluation, tool testing, and AI search data analysis.

Last reviewed: 7 April 2026

Need help with AI search visibility?

Get a free AI visibility audit to see how your business appears across ChatGPT, Gemini, Perplexity, and AI Overviews.

Request your free audit