Content Signals Policy: robots.txt & llms.txt Guide

AI is crawling your site right now. The question is whether you're telling it what to do — or letting it decide for itself. Here's how robots.txt and llms.txt give you back control.

AI is crawling your site right now. The question is whether you're telling it what to do — or letting it decide for itself.

GPTBot (OpenAI's crawler) has grown from 5% of all AI crawler traffic to 30% in a single year. PerplexityBot saw a 157,490% increase in raw requests over the same period. These aren't edge cases. They're the new normal. And most brands haven't thought seriously about what this means for their content.

That's where Content Signals Policy comes in — and why two files, robots.txt and llms.txt, are becoming essential tools for any content team that wants to stay in control.

What Is a Content Signals Policy?

A Content Signals Policy is your brand's deliberate stance on how AI systems can access, use and cite your content. It answers three questions:

Which AI crawlers are allowed to access your site?
Can they use your content to train their models?
How should they represent your content when answering user queries?

Without a policy, you're not neutral — you're opted in by default. Silence is permission.

robots.txt: The First Line of Defence

robots.txt has been around since 1994. It's a plain text file that sits in your website's root directory and tells crawlers what they can and can't access. Until recently, it was mostly a conversation between your site and Googlebot.

That's changed. Now it's the primary tool for managing a growing list of AI bots — each with different purposes, different compliance records and different implications for your content.

Here's who's crawling and what you can do about it:

Bot	Company	Purpose	User Agent Token
GPTBot	OpenAI	Model training	`GPTBot`
ClaudeBot	Anthropic	Chat citations & training	`ClaudeBot`
Google-Extended	Google	Gemini AI training	`Google-Extended`
PerplexityBot	Perplexity	AI search indexing	`PerplexityBot`
Meta-ExternalAgent	Meta	AI model training	`Meta-ExternalAgent`
Applebot-Extended	Apple	Apple Intelligence training	`Applebot-Extended`
CCBot	Common Crawl	Open LLM training datasets	`CCBot`

To block a bot from your entire site, add this to your robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

To allow everything, you don't need to do anything — bots treat silence as permission. But if you want to be explicit:

User-agent: GPTBot
Allow: /

The Google-Extended exception you need to understand

Google-Extended doesn't behave like a traditional crawler. It's a control token, not a user agent. When you block it, you're not blocking Googlebot — you're telling Google not to use your content to train Gemini. Your site still gets indexed normally. Your AI Overviews visibility is unaffected.

User-agent: Google-Extended
Disallow: /

This single directive keeps your content out of Google's AI training pipeline while leaving your search presence intact. It's one of the most important and underused signals available.

A word of caution

robots.txt is a gentleman's agreement. Legitimate AI companies — OpenAI, Google, Anthropic — honour it. Others don't. Research suggests it stops approximately 40–60% of AI bots. For serious enforcement, you'll need server-level rules or a Web Application Firewall.

llms.txt: The Other Side of the Equation

While robots.txt is about restriction, llms.txt is about opportunity.

Proposed by Jeremy Howard in September 2024, llms.txt is a Markdown file that lives at the root of your site. Its purpose is to help AI models understand your content — quickly, accurately and in context — at the point when a user is actively searching for information.

Think of it as a curated briefing document for AI. Rather than letting a model parse your entire site (which it probably can't fit into its context window anyway), you give it exactly what it needs: a structured summary, links to key pages, and clean Markdown versions of your most important content.

Why this matters for GEO

Generative Engine Optimisation — GEO — is how you get cited by AI tools like ChatGPT, Claude and Perplexity, not just ranked by Google. And getting cited requires more than good content. It requires that AI models can actually find, read and understand your content clearly.

A well-structured llms.txt file makes your site dramatically easier for AI to process. It's one of the most practical GEO moves you can make right now.

The Strategic Decision: Block, Allow or Optimise?

There's no universal right answer. Here's how to think about it.

Block AI training crawlers if you're concerned about your content being absorbed into models without attribution or compensation. Your long-form research, original data, proprietary methodology — that's IP worth protecting.

Allow AI search crawlers if you want referral traffic and citations from tools like Perplexity and ChatGPT. There's a real difference between GPTBot (training) and OAI-SearchBot (search indexing). You can block one and allow the other.

Optimise with llms.txt if your goal is visibility in AI-generated answers. This is the GEO play. It works alongside robots.txt — one controls access, the other shapes how AI uses what it can access.

The most sophisticated approach combines all three: selective robots.txt directives, a well-structured llms.txt, and clear Markdown versions of your key pages.

What This Means for Your Content Programme

Content Signals Policy isn't a technical task to hand off to a developer and forget. It's a content strategy decision. What you allow, block or surface to AI has direct implications for your brand's visibility in the places where your buyers are increasingly starting their research.

Right now, only around 14% of major domains have any specific AI directives in their robots.txt. That means most of your competitors have no policy at all. Getting this right now is a genuine competitive advantage.

At Content Gurus, we build robots.txt and llms.txt strategy into every content programme we run. Because controlling how AI finds and uses your content isn't optional anymore — it's part of what it means to do content properly in 2026.

Want to know how your site is currently signalling to AI crawlers? Contact Content Gurus today for a free content audit — we'll tell you exactly what the bots are seeing, and what to do about it.

AI is crawling your site right now. The question is whether you're telling it what to do — or letting it decide for itself.

That's where Content Signals Policy comes in — and why two files, robots.txt and llms.txt, are becoming essential tools for any content team that wants to stay in control.

What Is a Content Signals Policy?

A Content Signals Policy is your brand's deliberate stance on how AI systems can access, use and cite your content. It answers three questions:

Which AI crawlers are allowed to access your site?
Can they use your content to train their models?
How should they represent your content when answering user queries?

Without a policy, you're not neutral — you're opted in by default. Silence is permission.

robots.txt: The First Line of Defence

That's changed. Now it's the primary tool for managing a growing list of AI bots — each with different purposes, different compliance records and different implications for your content.

Here's who's crawling and what you can do about it:

Bot	Company	Purpose	User Agent Token
GPTBot	OpenAI	Model training	`GPTBot`
ClaudeBot	Anthropic	Chat citations & training	`ClaudeBot`
Google-Extended	Google	Gemini AI training	`Google-Extended`
PerplexityBot	Perplexity	AI search indexing	`PerplexityBot`
Meta-ExternalAgent	Meta	AI model training	`Meta-ExternalAgent`
Applebot-Extended	Apple	Apple Intelligence training	`Applebot-Extended`
CCBot	Common Crawl	Open LLM training datasets	`CCBot`

To block a bot from your entire site, add this to your robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

To allow everything, you don't need to do anything — bots treat silence as permission. But if you want to be explicit:

User-agent: GPTBot
Allow: /

The Google-Extended exception you need to understand

User-agent: Google-Extended
Disallow: /

This single directive keeps your content out of Google's AI training pipeline while leaving your search presence intact. It's one of the most important and underused signals available.

A word of caution

llms.txt: The Other Side of the Equation

While robots.txt is about restriction, llms.txt is about opportunity.

Why this matters for GEO

A well-structured llms.txt file makes your site dramatically easier for AI to process. It's one of the most practical GEO moves you can make right now.

The Strategic Decision: Block, Allow or Optimise?

There's no universal right answer. Here's how to think about it.

The most sophisticated approach combines all three: selective robots.txt directives, a well-structured llms.txt, and clear Markdown versions of your key pages.

Content Signals Policy: What robots.txt and llms.txt Mean for Your Content Strategy

What Is a Content Signals Policy?

robots.txt: The First Line of Defence

The Google-Extended exception you need to understand

A word of caution

llms.txt: The Other Side of the Equation

Why this matters for GEO

The Strategic Decision: Block, Allow or Optimise?

What This Means for Your Content Programme

Content Signals Policy: What robots.txt and llms.txt Mean for Your Content Strategy

What Is a Content Signals Policy?

robots.txt: The First Line of Defence

The Google-Extended exception you need to understand

A word of caution

llms.txt: The Other Side of the Equation

Why this matters for GEO

The Strategic Decision: Block, Allow or Optimise?

What This Means for Your Content Programme