Skip to content
Metaspry

Docs

AI crawler signals (GPTBot, ClaudeBot, Google-Extended, llms.txt)

Which AI crawlers fetch your page, what each control actually does, and where the universally-misunderstood landmines hide.

AI-crawler traffic doubled in H2 2025 (3.3M → 5.6M domains served by blocked crawlers, per The Register), and the controls are universally misunderstood. This page is the no-nonsense reference for what each signal does in 2026, with the specific landmines that show up over and over in audits.

The bots that matter (and the vendor 3-bot trap)

Bot user-agentVendorPurposeHonors robots.txt?
GPTBotOpenAITraining crawlerYes
ChatGPT-UserOpenAIUser-triggered fetch (Browse with Bing / share)Yes
ClaudeBotAnthropicTraining crawlerYes
Claude-UserAnthropicUser-triggered fetchYes
Claude-SearchBotAnthropicReal-time search for ClaudeYes
Google-ExtendedGoogleGemini training opt-out tokenYes
GooglebotGoogleSearch index (also feeds AI Overviews)Yes
PerplexityBotPerplexitySearch-answer crawlerMixed (see below)
Applebot-ExtendedAppleApple Intelligence training opt-outYes
CCBotCommon CrawlPublic dataset (feeds training of many models)Yes
BytespiderByteDance (TikTok)Training crawlerOften ignored
FacebookBotMetaLlama trainingYes
# Block all three Anthropic identities
User-agent: ClaudeBot
Disallow: /

User-agent: Claude-User
Disallow: /

User-agent: Claude-SearchBot
Disallow: /

The same trap exists for OpenAI (GPTBot is training, ChatGPT-User is fetch-on-prompt).

The Google-Extended myth

This is the single most-confused control in technical SEO 2026.

If you set…Result
User-agent: Google-Extended + Disallow: /Page excluded from Gemini training data. Search ranking unchanged. AI Overview citation unchanged — the page can still be cited.
User-agent: Googlebot + Disallow: /Page removed from Google Search index. Site-killing if applied to production.
<meta name="robots" content="nosnippet">Page can be indexed but no snippet shown in regular SERP and no AI Overview citation. This is the actual AI Overview opt-out.

llms.txt: validate, don’t worship

The proposal at llmstxt.org defines a markdown index of a site’s pages for LLM consumption.

Reality check, May 2026:

  • Gary Illyes confirmed Google does not use llms.txt.
  • SE Ranking studied 300,000 domains: zero measurable lift in AI citations after adding llms.txt.
  • Anthropic and Perplexity do honor it. OpenAI’s behavior is undocumented.
  • Adoption is ~0.1% of AI crawler requests touching /llms.txt (OtterlyAI, 90-day window).

Metaspry validates llms.txt when present, but flags the honest context — present is fine, absent is fine, Google does not care. Tools that show “MISSING llms.txt” as a red error are participating in a misinformation loop.

13% of AI bots ignore robots.txt entirely

Bytespider, several specialized scrapers, and the long tail of fly-by-night LLM trainers do not respect Disallow:. If you have a publisher-grade business reason to block training, robots.txt is not enough. You need:

  • Cloudflare’s “Block AI bots” rule (free tier).
  • Or per-vendor IP / user-agent blocks at your CDN / origin.
  • Or paid services like DarkVisitors / TollBit.

The extension flags this with a “robots.txt alone is insufficient — consider server-level rules” hint when you’ve disallowed AI bots.

Cross-checks the audit runs

When you run an audit on a page, the AI-crawler row surfaces:

  1. Partial-vendor blocks. “You block GPTBot but allow ChatGPT-User” or the Anthropic 3-bot pattern.
  2. Goal-conflicting setups. “You block Google-Extended (Gemini training) but the page is set up to be cited in AI Overviews — Google-Extended does not change that. Use nosnippet if you want out.”
  3. SPA traps. If the meta tags are injected via JavaScript, AI crawlers that fetch static HTML only (GPTBot, ClaudeBot, PerplexityBot per Cloudflare Q1 2026 data) will see an empty page. The audit flags this when it detects a pre-JS vs rendered-DOM diff.

What to do this week

If you’ve never touched your AI-crawler policy:

  1. Run Metaspry on your home page. Check the Site tab → AI crawlers row.
  2. Decide your stance: full-allow, allow-search-block-training, or full-block.
  3. Update robots.txt accordingly — covering all bot identities per vendor.
  4. Re-run Metaspry. Confirm the row reflects what you intended.