robots.txt is a text file in your site’s root directory that tells crawlers which pages they can access. One wrong line can make your entire site vanish from AI search — without you even knowing.
Why robots.txt Is GEO’s First Checkpoint
You can spend three months optimizing content, crafting Answer Blocks, and building semantic field coverage — but if robots.txt blocks AI crawlers, all that work is wasted. This is a three-minute check, but getting it wrong means game over.
AI Crawlers You Need to Allow
Search/Retrieval Crawlers (RAG Channel Entry — Must Allow)
| User-Agent | Product | Purpose |
|---|---|---|
| OAI-SearchBot | ChatGPT web search | ChatGPT’s real-time search citations |
| ClaudeBot | Claude | Claude search citations |
| PerplexityBot | Perplexity | Perplexity AI search retrieval |
| Googlebot | Google (incl. AI Overviews) | Google Search and AI Overviews |
Training Crawlers (Decide Based on Your Strategy)
| User-Agent | Company | Purpose | Consideration |
|---|---|---|---|
| GPTBot | OpenAI | Training data collection | Allow = chance to enter parametric memory; Block = protect IP |
| Google-Extended | Gemini training data | Same trade-off | |
| CCBot | Common Crawl | Open training datasets | Same trade-off |
Key distinction: OAI-SearchBot (retrieval) and GPTBot (training) are two different OpenAI crawlers. Most businesses want to be cited by AI but don’t want content used for training — configure them separately.
Chinese AI Crawlers (Important for Global Sites)
If your site targets Chinese-speaking audiences or operates in the Chinese market:
| User-Agent | Product | Purpose |
|---|---|---|
| Baiduspider | Baidu AI Search | China’s largest AI search — critical for Chinese market GEO |
| Bytespider | ByteDance (Doubao) | ByteDance’s AI products data collection |
| DeepSeekBot | DeepSeek | DeepSeek AI retrieval |
For global sites with Chinese audience, blocking Baiduspider means losing the entire Chinese AI search market.
Common Fatal Misconfigurations
Mistake 1: Blanket blocking all crawlers
User-agent: *
Disallow: /
Blocks all search engines AND all AI crawlers. Your site is invisible to everyone.
Mistake 2: Security plugins silently blocking AI crawlers
WordPress security plugins (Wordfence, iThemes Security, etc.) may auto-add blocking rules. You might not know GPTBot or ClaudeBot has been blocked — regularly check your actual robots.txt content.
Mistake 3: Only allowing Googlebot
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
Google can crawl you, but ChatGPT, Claude, Perplexity, and all other AI crawlers are blocked.
Recommended Configuration
# Retrieval crawlers — must allow
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Chinese AI crawlers (if you have Chinese audience)
User-agent: Baiduspider
Allow: /
# Training crawlers — decide per your IP strategy
# To allow training (benefits parametric memory):
User-agent: GPTBot
Allow: /
# To block training (protects intellectual property):
# User-agent: GPTBot
# Disallow: /
How to Check
- Visit
https://yourdomain.com/robots.txtin a browser - Search for OAI-SearchBot, GPTBot, ClaudeBot, PerplexityBot, Baiduspider
- Check if
User-agent: *has broad Disallow rules - If blocking found, fix immediately — AI crawlers read the updated rules on their next visit, typically within days
Server Log Verification
After modifying robots.txt, verify with server logs:
grep 'GPTBot|ClaudeBot|PerplexityBot|Baiduspider|OAI-SearchBot' access.log | awk '{print $9}' | sort | uniq -c
Status codes changing from 403 to 200 confirms the fix is working.
What This Means for GEO
robots.txt is covered in Get AI to Speak for You: The Definitive Guide to GEO, Chapter 4, Section 4.5. It’s the first gate of “Crawlability” in Formula 3 (Latent Authority ≈ Entity Salience × (Crawlability + Extractability)). Wrong robots.txt = zero crawlability = everything built on top collapses.
Further Reading
- Get AI to Speak for You: The Definitive Guide to GEO, Chapter 4, Sections 4.5 and 4.6
- Free GEOBOK tool: AI Crawlability Detection
