Why are AI crawlers blocked on enterprise websites?

Enterprise Web Application Firewalls (WAFs) from Cloudflare, Akamai, and Imperva are typically configured to block unrecognized bots as a security measure. AI crawlers from OpenAI (GPTBot), Anthropic (anthropic-ai), and Perplexity (PerplexityBot) fall into this unrecognized category by default. Adding these user agents to an explicit allow list before bot challenge rules fire resolves the issue without impacting security posture.

What is llms.txt and how does it help with GEO?

An llms.txt file is a plain text file hosted at yourdomain.com/llms.txt that provides AI scrapers with a curated list of your most important public content. It acts as a table of contents for your expertise, helping ChatGPT, Perplexity, and other generative AI tools find and prioritize your authoritative content. It contains no proprietary code or sensitive data — only public URLs and brief descriptions.

How does JSON-LD schema help with Google AI Overviews?

JSON-LD schema provides machine-readable context for your content. FAQPage schema directly signals to Google's systems that a section contains structured question-and-answer pairs eligible for AI Overview inclusion. TechArticle and Article schema with author markup strengthens E-E-A-T signals that Google uses to evaluate content quality for AI Overviews. Service and Product schema helps Google understand what your company offers and surfaces it in relevant generative responses.

What is information gain and why does it matter for GEO?

Information gain is the degree to which a piece of content adds new, verifiable knowledge that AI models cannot synthesize from their existing training data. Content that restates widely known information is paraphrased or ignored by LLMs. Content containing proprietary data points, specific case study outcomes, named methodologies, or expert perspectives that differ from the consensus earns direct citations because the AI must attribute unique claims it cannot generate on its own.

How does IndexNow connect to ChatGPT?

Microsoft is the largest investor in OpenAI. ChatGPT's real-time web search runs on Bing's index — when a user asks ChatGPT a question with web search enabled, results are sourced from Bing. A page not in Bing's index is not retrievable by ChatGPT. IndexNow is the fastest path to Bing indexing because it pushes new and updated URLs to search engines at the moment of publish, rather than waiting for a passive crawl. Implementing IndexNow via the IndexNow API or Cloudflare's Crawler Hints feature is therefore the fastest path into ChatGPT's retrieval pipeline.

Resources > Digital Marketing Guides

The Technical SVO Playbook for Marketing & Engineering Teams

May 7, 2026 • 13 minutes to read

Search Visibility Optimization (SVO) is the discipline of building a technical foundation that makes your site discoverable, readable, and citable across Google's ranking algorithm, Google AI Overviews, and generative AI models like ChatGPT and Perplexity — simultaneously.

All three engines depend on the same underlying infrastructure. Build it correctly and rankings, AI Overview appearances, and LLM citations all compound together. Leave gaps and each engine is affected.

This guide is built for marketing leaders and their engineering teams to use together. Every item includes the technical reason it belongs on the list. For the strategic layer — positioning, messaging, and building content that earns citations — see The Ultimate Guide to SVO.

Why the technical foundation matters across all three engines

According to AI Rank Lab, 40–60% of US informational searches now trigger a Google AI Overview. Generative models like ChatGPT and Perplexity also crawl the web independently through their own bots — and most enterprise web firewalls are blocking them without anyone realizing it. Every item below has a direct, traceable effect on whether a specific engine can find, read, and cite your content.

Phase 1: The access layer

Before any engine can surface your content, it needs to be able to reach it. This section covers the foundational prerequisites first, then the SVO-specific additions.

Foundation: prerequisites every site needs

These are table stakes for any crawlable site. If they are not already in place, nothing else in this guide will have full effect.

XML sitemap

Without a sitemap, crawlers discover pages through link traversal only. Pages with few internal links — new content, deep content, campaign landing pages — may never be found organically, or may take weeks longer to index than pages a crawler arrives at via links.

Submit your sitemap to both Google Search Console and Bing Webmaster Tools. Bing's index feeds ChatGPT's real-time web search — a page not in Bing's index cannot be retrieved by ChatGPT regardless of how well it is optimized.

robots.txt

robots.txt is evaluated by crawlers before any web firewall or server-level access control. A Disallow rule that catches AI crawler user agents will block them regardless of what the firewall permits — the two configurations must be consistent.

Audit your robots.txt for wildcard Disallow rules. A single Disallow: / under User-agent: * blocks every crawler on the web including Googlebot, GPTBot, and PerplexityBot simultaneously.

Canonical tags

Without canonical tags on duplicate or near-duplicate URLs, crawl budget is split across all variants. The intended page receives a fraction of the crawl attention it would otherwise get, and crawlers may index the wrong version.

Canonical tags are especially critical on enterprise sites with faceted navigation, parameter-based filtering, campaign UTM variants, and translated or region-specific content. Every URL that should not be independently indexed needs a canonical pointing to the authoritative version.

Hreflang (international and multi-region sites only)

Without hreflang, Google's international serving infrastructure selects language and region variants by content inference. This results in wrong-language pages appearing in wrong-market SERPs — and AI models trained on those signals may retrieve and cite the wrong content variant for a given audience.

Hreflang is only relevant if you serve multiple regions or languages. If you do, incorrect or missing hreflang is one of the most common sources of international ranking failures.

Redirect hygiene

Each redirect hop is a separate HTTP request and consumes a crawl budget unit. A three-hop chain means three requests to reach one page. PageRank and crawl priority dilute at each step, and long redirect chains are a common cause of indexing delays for new and updated content.

Audit for redirect chains (A → B → C) and redirect loops. Direct all internal links to the final destination URL. Chains commonly accumulate after site migrations and rebrands.

Diagnose before you proceed: three GSC checks

Run these before implementing anything below. You do not need an SEO background to do this — you are looking at traffic lights: green is fine, red needs attention.

1. Check which pages Google has skipped

In Google Search Console: go to Indexing → Pages in the left sidebar. You will see a list of reasons why pages were not indexed.

Look for two categories: "Discovered — currently not indexed" and "Crawled — currently not indexed."

A handful of pages in either category is normal — new pages always take time. If you see dozens or hundreds, that is a signal that Google is deliberately skipping pages, usually because it considers them low-quality, near-duplicate, or not worth the crawl time. IndexNow and the other steps below will not fix that. It requires a different conversation about content consolidation.

2. Check whether your pages look the same to Google as they do to you

In Google Search Console: paste any high-priority page URL into the search bar at the top and hit enter. On the results screen, click "View Crawled Page."

You will see two tabs: Screenshot and HTML.

The Screenshot shows what the page looks like visually to Google.
The HTML tab shows the underlying code Google actually received.

What you are looking for: does the HTML tab contain your actual page content — headings, body copy, product descriptions? Or is it mostly empty, with just a few lines of code and no readable text?

If the screenshot looks like your full page but the HTML tab is nearly empty, your page is built in a way where the content only appears after a visitor's browser runs additional code. Google often cannot do that on its first crawl. The content it is trying to rank is invisible to it. This is an engineering fix — flag it with the path to that page and share this guide.

If both tabs show your content, you are fine.

3. Check Core Web Vitals (CWV)

In Google Search Console: go to Experience → Core Web Vitals in the left sidebar. You are looking for pages marked in red ("Poor") or orange ("Needs Improvement").

Core Web Vitals measure how fast and stable your pages feel to a real user. Chronic failures here do not directly block crawlers, but they often indicate underlying infrastructure problems — slow servers, oversized images, bloated code — that affect everything including how often Google decides to crawl the site.

Bot Deny Listing — The Silent Killer

Your site's web firewall is configured to protect against malicious bots by blocking unrecognized traffic. AI crawlers are unrecognized by default — they send the same type of request as any other bot, identified only by their user agent string. Most enterprise web firewalls block them silently. The crawlers receive no response, and nothing in your server logs or analytics indicates it's happening.

This is the most commonly missed issue on enterprise sites. The goal for your engineering team is straightforward: ensure known AI crawlers are explicitly allowed before any blocking rules apply to them.

The broadest and most future-proof approach is to allow all verified good-bot traffic by policy — rather than maintaining a manual list that grows stale as new AI models launch. At minimum, confirm these are explicitly permitted:

User Agent	Company	What It Powers
`GPTBot`	OpenAI	ChatGPT browsing and web indexing
`OAI-SearchBot`	OpenAI	ChatGPT real-time search
`anthropic-ai`	Anthropic	Claude
`PerplexityBot`	Perplexity	Perplexity AI
`CCBot`	Common Crawl	Training data source for many LLMs
`Google-Extended`	Google	Gemini AI training
`Applebot-Extended`	Apple	Apple Intelligence
`DeepSeekBot`	DeepSeek	DeepSeek AI
`MoonshotBot`	Moonshot AI	Kimi AI assistant

IndexNow

Standard crawl-and-index is passive — search engines check for new content on their own schedule. IndexNow is a push protocol that notifies search engines at the moment of publish, eliminating the lag between when content exists and when it can be indexed.

Microsoft is the largest investor in OpenAI. ChatGPT's real-time web search runs on Bing's index — when a user asks ChatGPT a question with web search enabled, results are sourced from Bing. A page not in Bing's index is not retrievable by ChatGPT. IndexNow is the fastest path to Bing indexing, and by extension the fastest path into ChatGPT's retrieval pipeline.

Implementation varies by stack. The IndexNow API documentation covers the protocol in full. If your site runs behind Cloudflare, a native Crawler Hints integration may be available — check Cloudflare's documentation for the current configuration option. When enabled, it handles the push automatically whenever a cached URL is updated.

Phase 2: The understanding layer

Accessing a page and understanding it are two different problems. This phase covers the structured data layer that tells every engine what type of entity a page represents, who created it, and what it is about.

JSON-LD schema

AI models are increasingly capable of reading and interpreting web pages on their own. Capable, however, is not the same as certain. Without structured data, a model reading a case study page might identify a client's logo as your company logo, misattribute authorship, or misclassify what the page is entirely. Structured data removes that ambiguity — it provides a deterministic source of truth that engines use instead of inference. Every schema type below tells a specific engine something it cannot reliably determine on its own.

JSON-LD is the implementation format Google and every major AI system reads. It lives in a <script type="application/ld+json"> tag in the document <head> and requires no changes to the visual design of the page.

Most enterprise sites have a basic Organization block. The real work is templating schema at the content type level so every page automatically declares what it is.

Organization — the entity foundation

Every site needs this. The sameAs attribute is particularly important for AI models: linking to your LinkedIn company page, G2 profile, and Crunchbase entry creates entity graph connections that AI models use to disambiguate your brand from similarly named organizations when generating responses.

Here is what a complete Organization block looks like, using MKG Marketing as an example:

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "@id": "https://mkgmarketinginc.com/#organization",
  "name": "MKG Marketing, Inc.",
  "url": "https://mkgmarketinginc.com/",
  "logo": {
    "@type": "ImageObject",
    "url": "https://mkgmarketinginc.com/images/mkg-logo.png"
  },
  "description": "B2B digital marketing agency specializing in Search Visibility Optimization across SEO, AEO, and GEO.",
  "sameAs": [
    "https://www.linkedin.com/company/mkg-marketing",
    "https://twitter.com/mkgmarketinginc",
    "https://www.facebook.com/mkgmarketinginc",
    "https://www.instagram.com/mkgmarketinginc",
    "https://www.crunchbase.com/organization/mkg-marketing"
  ]
}

Schema by page type

TechArticle or Article for blog posts and editorial content. Include author markup — pages with verifiable authorship are treated differently in AI Overview candidate selection than anonymous or organization-attributed content.
Service or Product for solution and product pages. Include description and offers.
FAQPage for any section containing questions and answers. This explicitly marks Q&A pairs as structured, citable content units that help AI systems identify and extract answers from your pages. Note: As of May 2026, Google no longer generates FAQ rich results in search — the markup does not produce a visual enhancement in the SERP, but Google continues to use it for page understanding and AI response generation.
HowTo for process-oriented content. Same logic as FAQPage for step-by-step queries.
Person for team bio pages. Include sameAs pointing to each person's LinkedIn profile.
BreadcrumbList on all page templates. This is the only schema type that directly modifies the search result display — replacing the raw URL with a readable path. It also communicates site hierarchy to crawlers, helping them understand the relationship between parent and child pages.

After implementation, validate every schema type with Google's Rich Results Test. If the Rich Results Test cannot parse a schema block, AI Overview systems cannot use it either.

Phase 3: The prioritization layer

Phases 1 and 2 ensure engines can find and understand the site. Phase 3 covers what you surface to them once they arrive.

llms.txt

A static text file served at yourdomain.com/llms.txt with Content-Type: text/plain. No server-side code, no database dependency, no dynamic generation. Semantically equivalent to robots.txt in implementation complexity — it is a read hint for AI scrapers, not executable code.

llms.txt provides AI scrapers with a curated list of your most important public content. Without it, scrapers must infer priority from link structure and crawl frequency. With it, you control which pages they reach first.

Structure your llms.txt like this:

# Your Company Name

> One sentence describing what you do and who you serve.

## Documentation
- [Product Docs Title](https://yourdomain.com/docs/): Brief description.

## Resources
- [Guide Title](https://yourdomain.com/resources/guide/): What a reader learns from this.

## Case Studies
- [Client Result Title](https://yourdomain.com/case-studies/client/): The core outcome demonstrated.

Include your most authoritative technical content, original research, methodology pages, and case studies. Leave out generic pages that restate widely known information — AI models already have that content and will not prioritize citing it.

Answer-first content structure

Google AI Overviews extract answer content from a specific structural pattern: a direct, standalone response to the query in the first 40–60 words of a section, under a heading that matches the question format. Pages that bury the answer in paragraph three are not extracted regardless of their ranking position.

For every major solution page and pillar content piece:

H1 answers or closely mirrors the primary query the page targets
First paragraph after each H2 provides a complete, standalone answer — not a teaser
FAQ sections use real buyer questions
Definition pages lead with "Term is precise definition" in the first sentence

Phase 4: The differentiation layer

This phase is for the marketing leader, not engineering. No implementation required — this is about what the content itself contains.

The technical infrastructure in Phases 1–3 determines whether engines can find and read your content. Phase 4 determines whether they cite you or a competitor.

Information gain

Every piece of content falls into one of two categories as far as an LLM is concerned.

Repackaged knowledge summarizes what is already in the model's training data. The model will paraphrase it or ignore it — citing a more authoritative source achieves the same result for the AI without attributing anything to you.

Information gain adds something the model cannot synthesize from existing data: a proprietary data point, a specific client outcome, a named methodology, a contrarian position backed by evidence. This content earns citations because the AI must attribute claims it cannot generate on its own.

The information gain audit

Before technical optimization, bring SMEs to a content review with one question: "If an AI researcher asked Claude to summarize everything publicly known about this topic, does our page add anything new to that summary?"

If not, add at least one of the following before the page goes into the technical pipeline:

A data point from your own client work: "Based on our analysis of X companies over Y period..."
A specific, named case study outcome — mechanism and result, not category and vague improvement
A proprietary framework or methodology with a name that belongs to your company
A direct SME perspective that differs from the prevailing consensus

Complete implementation checklist

Work top to bottom. Each layer depends on the one before it.

Foundation

XML sitemap exists, is current, and is submitted to Google Search Console and Bing Webmaster Tools
robots.txt audited — no wildcard Disallow rules that catch AI crawler user agents
Canonical tags implemented on all duplicate, near-duplicate, and parameter-based URLs
Hreflang implemented correctly for all region/language variants (international sites only)
Redirect audit complete — no chains longer than one hop, no loops

Access

GSC Coverage report reviewed — no unexplained "Discovered — currently not indexed" spikes
GSC URL Inspection run on key pages — View Crawled Page → HTML tab contains actual page content (not empty)
Web firewall allowlist includes: GPTBot, OAI-SearchBot, anthropic-ai, PerplexityBot, CCBot, Google-Extended, Applebot-Extended, DeepSeekBot, MoonshotBot
IndexNow active and firing on publish — via API integration or Cloudflare Crawler Hints (faster Bing indexing = faster ChatGPT retrieval)

Understanding

Organization schema includes sameAs links to LinkedIn, G2, Crunchbase
Blog posts and articles use TechArticle or Article schema with author markup
FAQ sections use FAQPage schema
Solution and product pages use Service or Product schema
Team bio pages use Person schema with sameAs LinkedIn URLs
All page templates include BreadcrumbList schema
All schema types validated via GSC Rich Results Test

Prioritization

llms.txt exists and is publicly accessible at yourdomain.com/llms.txt
llms.txt lists 10–15 most authoritative public pages with descriptions
Key solution pages lead with a direct answer in the first 40–60 words of each section

Differentiation

Each pillar page contains at least one proprietary data point, named methodology, or specific case study outcome
SMEs have reviewed high-priority pages for information gain before technical optimization

How to know if it's working

SEO (Google Search Console): Impressions and clicks for target queries, Rich Results status confirming valid enhanced markup, Coverage health showing no new excluded page spikes.

AEO: AI Overview appearances in GSC Performance data under "Search type: AI Overviews." Featured snippet positions for question-format queries. An increase in impressions without a proportional increase in clicks often indicates AI Overviews are surfacing your content.

GEO: Run target queries in ChatGPT, Perplexity, Claude, and Gemini monthly and document whether you are cited. Third-party monitoring tools — Profound, Goodie AI, and AI Rank Lab — provide automated tracking. There is no equivalent of GSC for LLM citations yet.

If Phases 1–3 are implemented correctly: GSC Rich Results validation improves within weeks, AEO impressions increase within 60–90 days, GEO citation frequency becomes measurable within a quarter. Phase 4 — information gain — compounds over time as proprietary content gets ingested and attributed across training and retrieval datasets.

For the strategic SVO framework across all three pillars, see The Ultimate Guide to Search Visibility Optimization (SVO). For AEO-specific content and schema strategy, see What is AEO — A Complete Guide.