Answer Engine Optimization (AEO) is about whether AI engines can reach your page, read it, and cite it. AI-crawler traffic doubled in H2 2025 (3.3M → 5.6M domains served by blocked crawlers, per The Register), and the controls are universally misunderstood. This page is the no-nonsense reference for what each signal does in 2026, with the specific landmines that show up over and over in audits.

The bots that matter (and the vendor 3-bot trap)

Bot user-agent	Vendor	Purpose	Honors robots.txt?
`GPTBot`	OpenAI	Training crawler	Yes
`ChatGPT-User`	OpenAI	User-triggered fetch (Browse with Bing / share)	Yes
`ClaudeBot`	Anthropic	Training crawler	Yes
`Claude-User`	Anthropic	User-triggered fetch	Yes
`Claude-SearchBot`	Anthropic	Real-time search for Claude	Yes
`Google-Extended`	Google	Gemini training opt-out token	Yes
`Googlebot`	Google	Search index (also feeds AI Overviews)	Yes
`PerplexityBot`	Perplexity	Search-answer crawler	Mixed (see below)
`Applebot-Extended`	Apple	Apple Intelligence training opt-out	Yes
`CCBot`	Common Crawl	Public dataset (feeds training of many models)	Yes
`Bytespider`	ByteDance (TikTok)	Training crawler	Often ignored
`FacebookBot`	Meta	Llama training	Yes

# Block all three Anthropic identities
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

The same trap exists for OpenAI (GPTBot is training, ChatGPT-User is fetch-on-prompt).

The Google-Extended myth

This is the single most-confused control in technical SEO 2026.

If you set…	Result
`User-agent: Google-Extended` + `Disallow: /`	Page excluded from Gemini training data. Search ranking unchanged. AI Overview citation unchanged - the page can still be cited.
`User-agent: Googlebot` + `Disallow: /`	Page removed from Google Search index. Site-killing if applied to production.
`<meta name="robots" content="nosnippet">`	Page can be indexed but no snippet shown in regular SERP and no AI Overview citation. This is the actual AI Overview opt-out.

llms.txt: validate, don’t worship

The proposal at llmstxt.org defines a markdown index of a site’s pages for LLM consumption.

Reality check, May 2026:

Gary Illyes confirmed Google does not use llms.txt.
SE Ranking studied 300,000 domains: zero measurable lift in AI citations after adding llms.txt.
Anthropic and Perplexity do honor it. OpenAI’s behavior is undocumented.
Adoption is ~0.1% of AI crawler requests touching /llms.txt (OtterlyAI, 90-day window).

Metaspry validates llms.txt when present, but flags the honest context - present is fine, absent is fine, Google does not care. Tools that show “MISSING llms.txt” as a red error are participating in a misinformation loop.

13% of AI bots ignore robots.txt entirely

Bytespider, several specialized scrapers, and the long tail of fly-by-night LLM trainers do not respect Disallow:. If you have a publisher-grade business reason to block training, robots.txt is not enough. You need:

Cloudflare’s “Block AI bots” rule (free tier).
Or per-vendor IP / user-agent blocks at your CDN / origin.
Or paid services like DarkVisitors / TollBit.

The extension flags this with a “robots.txt alone is insufficient - consider server-level rules” hint when you’ve disallowed AI bots.

Cross-checks the audit runs

When you run an audit on a page, the AI-crawler row surfaces:

Partial-vendor blocks. “You block GPTBot but allow ChatGPT-User” or the Anthropic 3-bot pattern.
Goal-conflicting setups. “You block Google-Extended (Gemini training) but the page is set up to be cited in AI Overviews - Google-Extended does not change that. Use nosnippet if you want out.”
SPA traps. If the meta tags are injected via JavaScript, AI crawlers that fetch static HTML only (GPTBot, ClaudeBot, PerplexityBot per Cloudflare Q1 2026 data) will see an empty page. The audit flags this when it detects a pre-JS vs rendered-DOM diff.

What to do this week

If you’ve never touched your AI-crawler policy:

Run Metaspry on your home page. Check the Site tab → AI crawlers row.
Decide your stance: full-allow, allow-search-block-training, or full-block.
Update robots.txt accordingly - covering all bot identities per vendor.
Re-run Metaspry. Confirm the row reflects what you intended.

robots.txt + sitemap.xml + llms.txt - how Metaspry fetches the three files
Privacy - what we send, where, and why
Indexability conflicts - the noindex / canonical / sitemap pile-up