QUICKSTART · LOCKED

Enter your enrollment email.

Quickstart lessons are gated. Enter the email you used at checkout.

Not enrolled yet? Get Quickstart for $147 →

    ← QUICKSTART  ·  MODULE 1 · LESSON 3 OF 4
    ≈10 MIN
  

MODULE 1 · LESSON 3

llms.txt and crawler access

Schema and entity work only matter if AI crawlers can actually reach your pages. llms.txt, modern robots.txt, and markdown variants are the technical layer underneath everything.

What llms.txt is and why it matters

llms.txt is a plain-text file you place at the root of your domain (yourbrand.com/llms.txt) that gives AI crawlers a curated, machine-friendly map of your site. Think of it as a robots.txt + sitemap.xml hybrid built specifically for large language models.

The format was proposed in late 2024 and has been informally adopted across the AI search industry. Anthropic, OpenAI, and Perplexity have all referenced it as a signal they consider. As of mid-2026, having llms.txt is not strictly required — but the brands deploying it are starting to see citation gains as AI engines lean on it for high-confidence content selection.

The file does three things:

Curates your best content — instead of letting AI engines guess which of your 200 blog posts matter, you point to the 15 they should prioritize.
Provides clean context — markdown summaries strip away navigation, ads, and footers that confuse content extraction.
Declares intent — your llms.txt signals you're a site that wants to be cited correctly, which is itself a trust signal.

The llms.txt format

The file is plain markdown, served as text/plain. Three required sections:

# Your Brand Name

> One-sentence description of what your brand does. Concrete, specific, fact-dense.

## About

Two-to-four sentence paragraph expanding on what you do, who you serve, and what
makes you the authoritative source on your topics. This is the most-cited
section — AI engines often pull this directly for "what is X" queries.

## Pillar content

- [GEO Foundations: complete guide](https://yourbrand.com/guide-1.md): One-line description.
- [Schema markup playbook](https://yourbrand.com/guide-2.md): One-line description.
- [Entity SEO reference](https://yourbrand.com/guide-3.md): One-line description.

## Tools

- [Free audit tool](https://yourbrand.com/audit): What it does.
- [Pricing](https://yourbrand.com/pricing): What you charge.

## Optional

- [Blog archive](https://yourbrand.com/blog): Updates and tutorials.
- [Changelog](https://yourbrand.com/changelog): Product updates.

Pillar content links should point to markdown versions of your top pages, not HTML. Ship a .md variant of each pillar page alongside its HTML version — same content, no chrome. AI crawlers fetch the markdown when available because it's cheaper and cleaner to extract.

Generating markdown variants

Three approaches, ordered by effort:

Manual. For 5-15 pillar pages, just write a clean markdown version by hand. Strip navigation, sidebars, footers. Keep H2s, paragraphs, code blocks, tables. Save as /guide.md next to /guide.
Build-time generator. Add a build step that converts your top pages from HTML to markdown automatically. Tools like turndown (Node) or html2text (Python) do this in 20 lines of code.
Runtime conversion. Add a content-negotiated endpoint that serves the same URL as markdown when Accept: text/markdown or when the path ends in .md. Most cleanly handled at the framework level.

The 5-15 manual versions of your most-cited pages will give you 80% of the value. Don't over-engineer.

robots.txt and the crawler access question

Your robots.txt file controls which crawlers can access your site. Most legacy robots.txt files were written for Googlebot, Bingbot, and a handful of legacy crawlers. AI crawlers are newer, with different user agents.

The seven AI crawlers that matter most as of mid-2026:

User-agent	Owner	Purpose
`GPTBot`	OpenAI	Training data for GPT models
`ChatGPT-User`	OpenAI	Real-time fetch when ChatGPT browses
`OAI-SearchBot`	OpenAI	ChatGPT search index
`ClaudeBot`	Anthropic	Claude training + web search
`PerplexityBot`	Perplexity	Perplexity search index
`Google-Extended`	Google	Gemini training (separate from Googlebot)
`Applebot-Extended`	Apple	Apple Intelligence training

If your robots.txt is "allow Googlebot, block everything else" (a common over-defensive pattern), you're invisible to every AI engine that matters. Default to allowing all seven of the above.

The minimum AI-friendly robots.txt

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

Sitemap: https://yourbrand.com/sitemap.xml

If you have private or sensitive paths (admin dashboards, gated content, internal tools), add Disallow: entries for those specifically — not blanket disallows.

Should you ever block AI crawlers?

Three legitimate reasons to block specific AI crawlers:

You're a paid content business (newsletters, paywalled journalism, premium databases) where AI summarization would destroy your revenue model. Block GPTBot and ClaudeBot training crawlers; allow search-only bots like OAI-SearchBot and PerplexityBot so you can still get cited.
You have legal/regulatory exposure (medical, legal, financial advice) where AI synthesis could create liability. Selective blocking + a Terms of Use page declaring training-data prohibition.
You're a competitor's training data target and they're directly scraping you. Different problem — blocking AI crawlers won't stop them; you need rate limiting + IP blocking + DMCA enforcement.

For most businesses, the answer is allow everything. AI citation is a customer acquisition channel. Blocking the crawlers blocks the channel.

ai.txt vs llms.txt vs other proposed standards

Several competing standards emerged in 2024-2025: ai.txt, llms.txt, llms-full.txt, tdm-policy.json. As of mid-2026, llms.txt has the strongest informal adoption among the major AI companies.

The pragmatic stance: ship llms.txt now (it's free and adopted), watch the others, add a tdm-policy.json only if you have specific data-mining restrictions to declare.

Implementation: shipping crawler access this week

Day 1. Audit your current robots.txt. Note which AI crawlers it blocks (often inadvertently via blanket disallows).
Day 2. Update robots.txt to explicitly allow the seven AI crawlers above. Test with each crawler's user agent to verify access.
Day 3. Identify your 5-15 most-important pages (pillar content, product pages, top blog posts).
Day 4-5. Generate clean markdown versions of those pages. Manual is fine. Save at parallel paths (/guide and /guide.md).
Day 6. Write llms.txt with the format above. Link to the markdown versions you just created.
Day 7. Deploy llms.txt at your domain root. Verify it's accessible at yourbrand.com/llms.txt with curl -A "ChatGPT-User".

What comes next

Lesson 1.4 covers the on-page pattern that Princeton's research found drives 40% citation lift: question-formatted H2 headings followed by 120-180 word answer blocks. This is the single biggest content-structure intervention you can make.

UP NEXT · LESSON 1.4

Question-formatted H2s and answer blocks

Continue →

← Back to course dashboard