Most “page disappeared from Google” cases come from contradictions across four signals that nobody inspects together: <meta robots>, X-Robots-Tag, rel="canonical", and sitemap inclusion. The audit’s Indexability check is built around that pile-up.

The four signals

Signal	Set in	What it tells Google
`<meta name="robots">`	HTML `<head>`	Per-page index/follow rules
`X-Robots-Tag` HTTP header	Server response	Same as meta robots, but for non-HTML resources (PDFs, images) - wins over HTML on conflict
`<link rel="canonical">`	HTML `<head>`	”Treat this other URL as the version of record”
Sitemap inclusion	sitemap.xml	”This URL is important; please crawl it”

When they agree, indexing works. When they disagree, Google guesses - and usually guesses wrong.

The four most-frequent conflicts

1. noindex + sitemap inclusion

The page is in your sitemap (so Google should crawl) but <meta name="robots" content="noindex"> (so Google can’t index). Result: wasted crawl budget, slow deindex, contradictory signal that makes Google distrust the whole sitemap.

Fix: Pick one. If the page should be hidden, remove from sitemap and serve noindex. If it should be indexed, drop the noindex.

2. Canonical pointing to a noindex page

Page A canonicals to Page B. Page B has noindex. Net result: nothing indexed. Google honors the canonical, then honors the noindex.

Fix: Either remove noindex from the canonical target, or change the canonical to point to a real indexable version.

3. HTML says index, X-Robots-Tag says noindex

Your <head> has <meta name="robots" content="index, follow">. Your server returns X-Robots-Tag: noindex in the response headers. Most restrictive wins - and devs only ever inspect HTML.

Fix: Audit your server config and your edge layer. The header is invisible in browser dev tools’ Elements panel - only the Network tab shows it.

4. Self-redirecting canonical chain

<link rel="canonical" href="https://example.com/page"> where /page returns a 301 to /page/. Canonical → redirect → final. Google may pick the wrong canonical or ignore the hint entirely.

Fix: Canonical always points to the final 200-status URL.

What Metaspry checks per page

When you run the audit, the Indexability rule combines all four signals into a single verdict:

Indexable - all signals agree, page is eligible.
Conflict (mixed) - one or more signals disagree. The card shows which.
Excluded - page is intentionally blocked (noindex, canonical to off-site, or all four agree on hidden).

Each conflict is named, not just flagged:

“Canonical → noindex page”
“Self-redirecting canonical (301 chain)”
“X-Robots-Tag header overrides meta robots”
“noindex + in sitemap”
“Cross-domain canonical (likely staging leak)”
“Canonical missing (auto-canonical drift risk)“

Staging leaks to production

A specific case worth its own callout because it has caused multiple documented deindexings: cross-domain canonical.

After a migration, the production site’s <head> ends up with <link rel="canonical" href="https://staging.example.com/page">. Google believes you. The whole production site deindexes.

The audit flags any canonical whose hostname differs from the page’s hostname. If the canonical points to staging. / dev. / localhost / a different TLD, it’s surfaced as a critical alert.

robots.txt vs noindex

Worth restating because it confuses people: robots.txt blocking does not noindex.

Disallow: in robots.txt = don’t crawl. But Google may still index the URL based on inbound links, showing a snippet-less title-only result.
<meta robots noindex> or X-Robots-Tag: noindex = don’t index. Crawler still needs to fetch the page to see the directive.

If you want a page fully out of the index, you must allow it in robots.txt and serve a noindex directive. Blocking in robots.txt prevents Google from ever seeing the noindex.

AI crawler signals - GPTBot, ClaudeBot, Google-Extended
Audit rules - full rule list
Site files - robots.txt + sitemap.xml + llms.txt parsing