Docs
AI crawler signals (GPTBot, ClaudeBot, Google-Extended, llms.txt)
Which AI crawlers fetch your page, what each control actually does, and where the universally-misunderstood landmines hide.
AI-crawler traffic doubled in H2 2025 (3.3M → 5.6M domains served by blocked crawlers, per The Register), and the controls are universally misunderstood. This page is the no-nonsense reference for what each signal does in 2026, with the specific landmines that show up over and over in audits.
The bots that matter (and the vendor 3-bot trap)
| Bot user-agent | Vendor | Purpose | Honors robots.txt? |
|---|---|---|---|
GPTBot | OpenAI | Training crawler | Yes |
ChatGPT-User | OpenAI | User-triggered fetch (Browse with Bing / share) | Yes |
ClaudeBot | Anthropic | Training crawler | Yes |
Claude-User | Anthropic | User-triggered fetch | Yes |
Claude-SearchBot | Anthropic | Real-time search for Claude | Yes |
Google-Extended | Gemini training opt-out token | Yes | |
Googlebot | Search index (also feeds AI Overviews) | Yes | |
PerplexityBot | Perplexity | Search-answer crawler | Mixed (see below) |
Applebot-Extended | Apple | Apple Intelligence training opt-out | Yes |
CCBot | Common Crawl | Public dataset (feeds training of many models) | Yes |
Bytespider | ByteDance (TikTok) | Training crawler | Often ignored |
FacebookBot | Meta | Llama training | Yes |
# Block all three Anthropic identities
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Disallow: /
User-agent: Claude-SearchBot
Disallow: /
The same trap exists for OpenAI (GPTBot is training, ChatGPT-User is fetch-on-prompt).
The Google-Extended myth
This is the single most-confused control in technical SEO 2026.
| If you set… | Result |
|---|---|
User-agent: Google-Extended + Disallow: / | Page excluded from Gemini training data. Search ranking unchanged. AI Overview citation unchanged — the page can still be cited. |
User-agent: Googlebot + Disallow: / | Page removed from Google Search index. Site-killing if applied to production. |
<meta name="robots" content="nosnippet"> | Page can be indexed but no snippet shown in regular SERP and no AI Overview citation. This is the actual AI Overview opt-out. |
llms.txt: validate, don’t worship
The proposal at llmstxt.org defines a markdown index of a site’s pages for LLM consumption.
Reality check, May 2026:
- Gary Illyes confirmed Google does not use llms.txt.
- SE Ranking studied 300,000 domains: zero measurable lift in AI citations after adding llms.txt.
- Anthropic and Perplexity do honor it. OpenAI’s behavior is undocumented.
- Adoption is ~0.1% of AI crawler requests touching
/llms.txt(OtterlyAI, 90-day window).
Metaspry validates llms.txt when present, but flags the honest context — present is fine, absent is fine, Google does not care. Tools that show “MISSING llms.txt” as a red error are participating in a misinformation loop.
13% of AI bots ignore robots.txt entirely
Bytespider, several specialized scrapers, and the long tail of fly-by-night LLM trainers do not respect Disallow:. If you have a publisher-grade business reason to block training, robots.txt is not enough. You need:
- Cloudflare’s “Block AI bots” rule (free tier).
- Or per-vendor IP / user-agent blocks at your CDN / origin.
- Or paid services like DarkVisitors / TollBit.
The extension flags this with a “robots.txt alone is insufficient — consider server-level rules” hint when you’ve disallowed AI bots.
Cross-checks the audit runs
When you run an audit on a page, the AI-crawler row surfaces:
- Partial-vendor blocks. “You block GPTBot but allow ChatGPT-User” or the Anthropic 3-bot pattern.
- Goal-conflicting setups. “You block Google-Extended (Gemini training) but the page is set up to be cited in AI Overviews — Google-Extended does not change that. Use
nosnippetif you want out.” - SPA traps. If the meta tags are injected via JavaScript, AI crawlers that fetch static HTML only (GPTBot, ClaudeBot, PerplexityBot per Cloudflare Q1 2026 data) will see an empty page. The audit flags this when it detects a pre-JS vs rendered-DOM diff.
What to do this week
If you’ve never touched your AI-crawler policy:
- Run Metaspry on your home page. Check the Site tab → AI crawlers row.
- Decide your stance: full-allow, allow-search-block-training, or full-block.
- Update robots.txt accordingly — covering all bot identities per vendor.
- Re-run Metaspry. Confirm the row reflects what you intended.
Related
- robots.txt + sitemap.xml + llms.txt - how Metaspry fetches the three files
- Privacy - what we send, where, and why
- Indexability conflicts - the noindex / canonical / sitemap pile-up