What Is Robots.txt for AI Crawlers?
Robots.txt has not changed since 1994. What has changed is who reads it. The same text file that has always told Googlebot what to crawl now also controls whether OpenAI, Anthropic, Perplexity, and dozens of other AI companies can access your content for model training, AI-generated responses, and citation databases. Understanding what is robots.txt for ai crawlers is no longer optional for site owners who care about where their content ends up and who gets to use it. The complexity is real: there are now more than 15 documented AI user-agent strings, each from a different company with different policies, different purposes, and different levels of transparency about what they do with what they crawl. According to Google’s crawlers documentation, Google alone operates multiple crawlers with distinct user-agents, and Google-Extended is a separate agent from Googlebot with a separate purpose. This post is part of the full guide on AI for technical SEO.
What Is Robots.txt for AI Crawlers: The Syntax That Works
Direct Answer: What is robots txt for ai crawlers means using User-agent directives with each AI crawler’s specific user-agent string, followed by Disallow or Allow rules for the paths you want to control. The wildcard User-agent: * does not reliably block all AI crawlers because some AI crawlers do not inherit wildcard rules in the same way traditional search bots do.
The basic syntax for AI crawler control:
# Block OpenAI's GPTBot from the entire site
User-agent: GPTBot
Disallow: /
# Block Google's AI training crawler but allow regular Googlebot
User-agent: Google-Extended
Disallow: /
# Allow Perplexity to crawl but not your private data directory
User-agent: PerplexityBot
Disallow: /internal/
Disallow: /members/
# Block Anthropic's training crawler
User-agent: ClaudeBot
Disallow: /
# Block Common Crawl (primary source for many open AI training datasets)
User-agent: CCBot
Disallow: /
The complete list of major AI crawler user-agent strings (as of mid-2026):
| Crawler | Company | Purpose | User-Agent String |
|---|---|---|---|
| GPTBot | OpenAI | ChatGPT training and search | GPTBot |
| Google-Extended | Gemini AI training and features | Google-Extended | |
| ClaudeBot | Anthropic | Claude model training | ClaudeBot |
| PerplexityBot | Perplexity | Perplexity AI search responses | PerplexityBot |
| CCBot | Common Crawl | Open training dataset | CCBot |
| Applebot-Extended | Apple | Apple AI features | Applebot-Extended |
| Amazonbot | Amazon | Alexa AI features | Amazonbot |
User-agent strings are case-sensitive. “gptbot” in lowercase does not match the same entries as “GPTBot” in some robots.txt parsing implementations. Use the exact string as documented by each company.
The Contrarian View: Blocking AI Crawlers Is Not Obviously Right
Most robots.txt guides for AI crawlers frame the question as “how do I block AI crawlers?” That framing assumes blocking is the default correct choice. It is not.
What is robots txt for ai crawlers in the context of AI search is a choice between two different types of visibility. Blocking a training crawler like GPTBot prevents your content from becoming training data for ChatGPT, which may protect your intellectual property but does not prevent ChatGPT from generating content on your topic using data it has already ingested. Blocking an AI search crawler like PerplexityBot or Google-Extended prevents your content from being cited in AI-generated search responses, which actively reduces the chances of your site being cited as an authoritative source in those responses.
The decision matrix:
GOAL: Prevent AI training data use
→ Block: GPTBot, CCBot, ClaudeBot
→ Allow: PerplexityBot, Google-Extended (these are search, not training)
GOAL: Maximize AI search visibility and citation
→ Allow: PerplexityBot, Google-Extended, GPTBot (ChatGPT search)
→ Consider: Block CCBot (pure training, no search benefit)
GOAL: Block all AI access entirely
→ Block all: use individual User-agent blocks for each crawler
→ Caution: this reduces AI citation visibility across all AI search platforms
For most publishers with informational content, allowing AI search crawlers and selectively blocking training-only crawlers is the better default. For how AI citation visibility connects to search performance, see how to track AI Overview impressions in GSC.
Step-by-Step: Setting Up AI Crawler Control in robots.txt
Step 1: Audit your current robots.txt.
Access your robots.txt at yourdomain.com/robots.txt. Identify what is currently specified. For most sites, what is robots.txt for ai crawlers in practice is an answer of “nothing specific” — only a User-agent: * block with no AI-specific directives, meaning all AI crawlers operate under the general rules, which may not match the site owner’s intent.
Step 2: Decide your policy per crawler type. Before writing any directives, answer two questions per crawler category: Do you want to be cited in this AI platform’s search responses? Do you want this company using your content for model training? For how to use AI tools to audit your technical site settings systematically, see how to automate technical SEO audits with AI.
Step 3: Write individual User-agent blocks. Write a separate block for each AI crawler you want to treat differently from the default. Do not rely on wildcards to handle AI crawlers: write explicit blocks.
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Allow: /blog/
Disallow: /
User-agent: CCBot
Disallow: /
Note the order: more specific directives override general ones within a block, but separate User-agent blocks are independent of each other.
Step 4: Validate with Google’s robots.txt Tester. Use Google Search Console’s robots.txt tester to confirm Googlebot and Google-Extended directives work as intended. This tool does not test third-party AI crawlers, but it confirms your syntax is valid and identifies parsing errors that would affect all crawlers.
Step 5: Monitor crawl behavior in server logs. After updating robots.txt, check server access logs for each AI crawler’s user-agent string. Confirming that GPTBot requests stop appearing in logs after a Disallow: / directive verifies that the policy is being honored. This is the only verification method available for non-Google crawlers.
Where AI Crawler Control Fails
Failure 1: Assuming User-agent: * blocks all AI crawlers. The most common misunderstanding of what is robots.txt for ai crawlers is that the wildcard rule blocks all bots equally. It does not. The wildcard rule blocks crawlers that fully inherit it. Some AI crawlers have their own parsing implementation and do not inherit wildcard Disallow rules the same way Googlebot does. Always write explicit User-agent blocks for each AI crawler you want to control. The wildcard is a default fallback, not a comprehensive AI blocker.
Failure 2: Confusing Google-Extended with Googlebot.
Blocking User-agent: Google-Extended does not affect Googlebot. Your pages continue to be crawled and indexed in standard Google Search. Google-Extended controls Google’s AI features: Gemini, AI Overviewss, and Google’s AI training. Blocking it reduces your visibility in Google’s AI-powered features while leaving standard search ranking unaffected. Many site owners block Google-Extended without understanding this distinction and then wonder why their AI Overview citations disappeared. For how AI Overview impressions in GSC connect to site visibility, see does AI affect Core Web Vitals.
Failure 3: Setting access control in robots.txt and expecting legal protection. Robots.txt is a technical advisory document, not a legal instrument. A crawler that ignores your robots.txt has violated no law simply by reading the file. Legal protection for content requires different instruments: Terms of Service, copyright notices, and in some jurisdictions emerging AI-specific legislation. Robots.txt is effective against compliant crawlers and ineffective against non-compliant ones. For sites where content protection is a primary concern, robots.txt is one layer of a multi-layer approach, not a standalone solution.
Failure 4: Forgetting to update robots.txt when site structure changes.
A Disallow rule for /private/ stops protecting content when the private section moves to /members/ after a site redesign. AI crawler directives require the same maintenance as any technical SEO element. After any significant URL restructure, audit your robots.txt against the current site architecture. For how redirect management interacts with crawler access, see how to use AI for redirect management.
Frequently Asked Questions
Four questions on what is robots.txt for ai crawlers answered directly, covering compliance, blocking syntax, Googlebot differences, and whether to block at all:
- Do AI crawlers respect robots.txt?
- How do I block GPTBot from crawling my site?
- What is the difference between blocking AI crawlers and blocking Googlebot?
- Should I block AI crawlers from my site?
Do AI crawlers respect robots.txt?
Most major AI crawlers respect robots.txt directives as a stated policy. Understanding what is robots.txt for ai crawlers in terms of compliance means recognizing that compliance is policy-based, not technically enforced: a crawler can ignore the file. OpenAI’s GPTBot, Anthropic’s ClaudeBot, Google’s Google-Extended, and Perplexity’s PerplexityBot all publish documentation stating they honor Disallow directives. The major platforms comply because violating robots.txt creates legal and reputational risk for companies with publicly known products. Smaller, undisclosed AI crawlers carry no such accountability. Monitoring server logs is the only way to confirm actual compliance.
How do I block GPTBot from crawling my site?
Add User-agent: GPTBot followed by Disallow: / to your robots.txt file. This blocks the entire site. To block only specific directories, use Disallow: /path/ for each path you want excluded. Apply the same approach for other AI crawlers using their published user-agent strings. Verify the directives are working by checking server access logs for GPTBot requests after the change: if requests continue appearing, the crawler is not honoring the directive.
What is the difference between blocking AI crawlers and blocking Googlebot?
Blocking Googlebot removes pages from Google Search indexing. Blocking AI crawlers prevents content use for AI training or AI-generated responses, while leaving Google Search indexing completely unaffected. Google-Extended is Google’s separate AI crawler: blocking it reduces AI Overview citations and Gemini AI visibility without touching standard search ranking. What is robots txt for ai crawlers in practice means making separate decisions for each crawler type based on the specific platform and purpose, not treating all bots as equivalent.
Should I block AI crawlers from my site?
The right answer depends on your goals. If citation visibility in AI search platforms matters (ChatGPT Search, Perplexity, Google AI Overviews), allowing those platforms’ crawlers is beneficial. If preventing AI training data use is the priority, blocking training-specific crawlers like CCBot and GPTBot is the correct choice. The distinction between training crawlers and search crawlers is the most important frame for this decision. Blocking all AI crawlers equally trades citation visibility for content protection, which is the right trade for some publishers and the wrong one for others.
Audit your robots.txt right now: navigate to yourdomain.com/robots.txt in your browser and look for any User-agent directives that mention AI crawlers. If you see none, your AI crawler access policy is entirely determined by the wildcard rule, which may not match what you intend. Add explicit directives for GPTBot, Google-Extended, and CCBot as a minimum, based on your specific goals. This takes under 10 minutes and is the only technical control you have over AI crawler access to your site. If you want help auditing your full crawler control setup and aligning it with your AI visibility strategy, my AI SEO services cover the technical and strategic layer together. That is what is robots.txt for ai crawlers in practice: a deliberate access policy per crawler type, not a single wildcard that leaves AI access decisions to chance.