llms.txt and crawler access
Schema and entity work only matter if AI crawlers can actually reach your pages. llms.txt, modern robots.txt, and markdown variants are the technical layer underneath everything.
What llms.txt is and why it matters
llms.txt is a plain-text file you place at the root of your domain (yourbrand.com/llms.txt) that gives AI crawlers a curated, machine-friendly map of your site. Think of it as a robots.txt + sitemap.xml hybrid built specifically for large language models.
The format was proposed in late 2024 and has been informally adopted across the AI search industry. Anthropic, OpenAI, and Perplexity have all referenced it as a signal they consider. As of mid-2026, having llms.txt is not strictly required — but the brands deploying it are starting to see citation gains as AI engines lean on it for high-confidence content selection.
The file does three things:
- Curates your best content — instead of letting AI engines guess which of your 200 blog posts matter, you point to the 15 they should prioritize.
- Provides clean context — markdown summaries strip away navigation, ads, and footers that confuse content extraction.
- Declares intent — your
llms.txtsignals you're a site that wants to be cited correctly, which is itself a trust signal.
The llms.txt format
The file is plain markdown, served as text/plain. Three required sections:
# Your Brand Name
> One-sentence description of what your brand does. Concrete, specific, fact-dense.
## About
Two-to-four sentence paragraph expanding on what you do, who you serve, and what
makes you the authoritative source on your topics. This is the most-cited
section — AI engines often pull this directly for "what is X" queries.
## Pillar content
- [GEO Foundations: complete guide](https://yourbrand.com/guide-1.md): One-line description.
- [Schema markup playbook](https://yourbrand.com/guide-2.md): One-line description.
- [Entity SEO reference](https://yourbrand.com/guide-3.md): One-line description.
## Tools
- [Free audit tool](https://yourbrand.com/audit): What it does.
- [Pricing](https://yourbrand.com/pricing): What you charge.
## Optional
- [Blog archive](https://yourbrand.com/blog): Updates and tutorials.
- [Changelog](https://yourbrand.com/changelog): Product updates.
Pillar content links should point to markdown versions of your top pages, not HTML. Ship a .md variant of each pillar page alongside its HTML version — same content, no chrome. AI crawlers fetch the markdown when available because it's cheaper and cleaner to extract.
Generating markdown variants
Three approaches, ordered by effort:
- Manual. For 5-15 pillar pages, just write a clean markdown version by hand. Strip navigation, sidebars, footers. Keep H2s, paragraphs, code blocks, tables. Save as
/guide.mdnext to/guide. - Build-time generator. Add a build step that converts your top pages from HTML to markdown automatically. Tools like
turndown(Node) orhtml2text(Python) do this in 20 lines of code. - Runtime conversion. Add a content-negotiated endpoint that serves the same URL as markdown when
Accept: text/markdownor when the path ends in.md. Most cleanly handled at the framework level.
The 5-15 manual versions of your most-cited pages will give you 80% of the value. Don't over-engineer.
robots.txt and the crawler access question
Your robots.txt file controls which crawlers can access your site. Most legacy robots.txt files were written for Googlebot, Bingbot, and a handful of legacy crawlers. AI crawlers are newer, with different user agents.
The seven AI crawlers that matter most as of mid-2026:
| User-agent | Owner | Purpose |
|---|---|---|
GPTBot | OpenAI | Training data for GPT models |
ChatGPT-User | OpenAI | Real-time fetch when ChatGPT browses |
OAI-SearchBot | OpenAI | ChatGPT search index |
ClaudeBot | Anthropic | Claude training + web search |
PerplexityBot | Perplexity | Perplexity search index |
Google-Extended | Gemini training (separate from Googlebot) | |
Applebot-Extended | Apple | Apple Intelligence training |
If your robots.txt is "allow Googlebot, block everything else" (a common over-defensive pattern), you're invisible to every AI engine that matters. Default to allowing all seven of the above.
The minimum AI-friendly robots.txt
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
Sitemap: https://yourbrand.com/sitemap.xml
If you have private or sensitive paths (admin dashboards, gated content, internal tools), add Disallow: entries for those specifically — not blanket disallows.
Should you ever block AI crawlers?
Three legitimate reasons to block specific AI crawlers:
- You're a paid content business (newsletters, paywalled journalism, premium databases) where AI summarization would destroy your revenue model. Block
GPTBotandClaudeBottraining crawlers; allow search-only bots likeOAI-SearchBotandPerplexityBotso you can still get cited. - You have legal/regulatory exposure (medical, legal, financial advice) where AI synthesis could create liability. Selective blocking + a Terms of Use page declaring training-data prohibition.
- You're a competitor's training data target and they're directly scraping you. Different problem — blocking AI crawlers won't stop them; you need rate limiting + IP blocking + DMCA enforcement.
For most businesses, the answer is allow everything. AI citation is a customer acquisition channel. Blocking the crawlers blocks the channel.
ai.txt vs llms.txt vs other proposed standards
Several competing standards emerged in 2024-2025: ai.txt, llms.txt, llms-full.txt, tdm-policy.json. As of mid-2026, llms.txt has the strongest informal adoption among the major AI companies.
The pragmatic stance: ship llms.txt now (it's free and adopted), watch the others, add a tdm-policy.json only if you have specific data-mining restrictions to declare.
Implementation: shipping crawler access this week
- Day 1. Audit your current
robots.txt. Note which AI crawlers it blocks (often inadvertently via blanket disallows). - Day 2. Update
robots.txtto explicitly allow the seven AI crawlers above. Test with each crawler's user agent to verify access. - Day 3. Identify your 5-15 most-important pages (pillar content, product pages, top blog posts).
- Day 4-5. Generate clean markdown versions of those pages. Manual is fine. Save at parallel paths (
/guideand/guide.md). - Day 6. Write
llms.txtwith the format above. Link to the markdown versions you just created. - Day 7. Deploy
llms.txtat your domain root. Verify it's accessible atyourbrand.com/llms.txtwithcurl -A "ChatGPT-User".
What comes next
Lesson 1.4 covers the on-page pattern that Princeton's research found drives 40% citation lift: question-formatted H2 headings followed by 120-180 word answer blocks. This is the single biggest content-structure intervention you can make.