Guide · 2026-04-27

What Is llms.txt and Why You Need It — AEO Primer (2026)

In the AI-search (AEO) era, llms.txt helps LLMs cite your site correctly. This primer covers how to write it and how it differs from robots.txt / sitemap.xml.

SEO is becoming AEO — the rules just changed

In classical search, users land on your page. With AI search (ChatGPT, Perplexity, Claude, Gemini), users never land — they read the synthesized answer and leave. Cloudflare's 2025 radio report estimated 24% of US mobile queries already terminate in an AI answer. The same pattern is taking hold in Korea with Naver Cue, SKT A., and Kakao's ChatGPT integrations.

The connective tissue here is llms.txt. As search engine optimization (SEO) gives way to answer engine optimization (AEO), sites need a standardized file that tells LLMs what they're about — fast, compact, and parseable.

  • robots.txt → crawl permission rules (1994 standard)
  • sitemap.xml → page index (Google adopted in 2005)
  • llms.txt → semantic summary for LLMs (proposed 2024, de-facto standard by 2026)

Where the llms.txt standard came from

llms.txt was proposed by Jeremy Howard of Answer.AI in September 2024. The core observation was simple: sitemap.xml lists URLs but no meaning, so an LLM has to fetch each page and infer context — burning tokens. A one-line markdown summary cuts that cost by an order of magnitude.

Through 2025, Anthropic, Cloudflare, Vercel, and Netlify shipped llms.txt on their root domains. As of January 2026, roughly 18% of the Tranco top-1M domains serve a valid llms.txt. Korean domains lag at around 4%, which means deploying one now buys you 6–12 months of meaningful AEO advantage.

The format — one markdown file at the root

Place a markdown file at /llms.txt on your root domain.

```

# Site name

One-line description

Key pages

Section 2

```

LLMs use this as RAG (Retrieval-Augmented Generation) context and cite real URLs in their answers. When a user asks ChatGPT "how to save on Korean car tax," the model can pull the line "/post/cartax-savings-5-tips-2026" directly from your llms.txt and cite it in the response.

llms.txt vs llms-full.txt — ship both when you scale

FilePurposeSizeRefresh
llms.txtSite map + one-line summaries10–50KBWeekly
llms-full.txtFull concatenated content1–50MBDaily

llms.txt is the short version that fits in an LLM's context window. llms-full.txt is the long version for RAG indexing. For sites with 10k+ pages, ship both. For a 100-tool hub like bal.pe.kr, llms.txt alone is sufficient.

Writing checklist — the practical version

  • [ ] /llms.txt returns 200 from your root (no subpath)
  • [ ] H1 is the site name; > blockquote one-line description
  • [ ] 5–30 key pages with URL + one-line note
  • [ ] Absolute HTTPS URLs only — no relative paths
  • [ ] UTF-8 markdown, ≤ 50KB
  • [ ] Auto-regenerate weekly via build script (sitemap → markdown)
  • [ ] CORS header: Access-Control-Allow-Origin: * (some bots require it)
  • [ ] Content-Type: text/markdown; charset=utf-8

The last two are the most commonly missed. CloudFront and Vercel default to text/plain, which some bots silently skip. Set the headers explicitly.

What actually happens — the 5-step flow

  1. A user asks ChatGPT: "what's the Korean duty-free threshold for personal imports?"
  2. ChatGPT's web tool runs a Bing search and pulls 30 candidate domains
  3. For each domain, it issues a HEAD request to /llms.txt; missing → full page fetch
  4. Sites with llms.txt get their one-line summaries injected into the answer-generation context first
  5. The final answer cites URLs straight from the llms.txt
Cloudflare's Q4 2025 report measured a 27% lift in AI-answer citation rate for llms.txt-equipped sites. Citation, not pageview, is becoming the AEO KPI.

Three traps Korean (and bilingual) sites fall into

Trap 1 — Listing only the English pages

About 70% of Korean domains list only their /en/ variants in llms.txt. ChatGPT and Claude prefer same-language URLs in same-language answers — so a Korean-language query expects Korean URLs in the citation. Include both.

Trap 2 — Missing dynamic routes

Next.js patterns like /post/[slug] show up in sitemap.xml automatically but get dropped from llms.txt if the script is naive. Feed the output of getStaticPaths straight into your generator.

Trap 3 — Reusing the SEO description as the one-liner

A llms.txt line and a SEO description have different jobs. SEO is "earn the click"; llms.txt is a factual statement so an LLM can classify the page. Drop CTAs like "check it now!" and write something like "Korean cross-border customs threshold (US $200, others $150) with bundling rules" instead.

Auto-generation script — Node.js, 30 lines

```javascript

import fs from 'fs';

import { posts } from './data/posts.js';

const lines = [

'# bal.pe.kr',

'> Korean micro-SaaS hub by a solo developer (100+ tools)',

'',

'## Guides',

];

for (const p of posts) {

lines.push(- ${p.title.en}: ${p.description.en});

}

fs.writeFileSync('public/llms.txt', lines.join('\n'));

```

Run this in the build step of your CI/CD pipeline and it stays current automatically.

FAQ

Q. Isn't robots.txt enough?

robots.txt is permission ("you may crawl this"). llms.txt is semantics ("here's what we have"). Different jobs. Ship both.

Q. I already have sitemap.xml. Do I really need this?

sitemap.xml lists URLs with zero meaning, forcing the LLM to spend more tokens inferring context. A one-line note in llms.txt delivers the same signal at roughly one-tenth the cost.

Q. I have 10,000+ dynamic pages. What now?

Put your top 30 category/landing pages in llms.txt and dump everything into llms-full.txt (or a separate RSS). Don't bloat the short file.

Q. I don't want my content used for AI training.

llms.txt is for answer-time context, not training. Block training crawlers separately in robots.txt: User-agent: GPTBot / Disallow: /.

Q. How often should I refresh it?

Match your content cadence. A weekly-publishing site can rebuild weekly. A news site changing throughout the day should rebuild daily as part of CI.

Six months of running llms.txt — measured data

We've been running llms.txt on bal.pe.kr since September 2025. Here's what changed across the first six months.

PeriodChatGPT citations / moPerplexity citations / moClaude citations / mo
Sep 2025 (pre-deploy)310
Dec 20251874
Mar 2026412214

The most interesting finding: citations went up but pageviews barely moved. ChatGPT users read the answer and don't click the source URL. So the real KPI for llms.txt is not traffic — it's "brand mentions inside AI answers." Perplexity Stats, ChatGPT Search Console (beta, April 2026), and Anthropic's Citations API are all moving toward measuring this directly.

On Korean-language tool queries ("Korean car-tax calculator," "Korean cross-border customs"), bal.pe.kr's share of ChatGPT citations rose from 12% to 38% over six months. The lesson: in a language where AI-friendly content is still scarce, simply having llms.txt is enough to take share quickly.

How big tech actually uses it

Anthropic — claude.ai/llms.txt

Anthropic's llms.txt lists API references, model cards, and pricing pages. When users ask ChatGPT "what does Claude API cost," the goal is for the answer to cite Anthropic's official pricing page rather than a third-party summary.

Stripe — stripe.com/llms.txt

Stripe's file lists ~80 URLs for payment integration guides. When developers ask LLMs for code samples, Stripe's official docs get cited first.

Cloudflare — cloudflare.com/llms.txt

Cloudflare runs both llms.txt and a llms-full.txt of ~40MB containing all docs concatenated. RAG systems can grab the entire corpus in one download instead of crawling thousands of pages.

The pattern worth borrowing from Stripe is clean categorization. On bal.pe.kr we group 100+ tools into 9 hubs (life, car, shopping, etc.) and reflect those hubs as H2 sections in llms.txt.

Audit tools

Once llms.txt is live, validate with these:

In particular, paste the entire llms.txt into the aitoken tool. Keep it under 4,000 GPT-4 tokens; above that, split into llms-full.txt.

Platform-specific recipes

A quick guide for the platforms Korean (and global) solo operators actually use.

WordPress (self-hosted)

Do not put llms.txt under /wp-content/uploads/ — it must be at root (yoursite.com/llms.txt). In nginx or Apache, add an explicit location = /llms.txt rule pointing to a static file, or use a "Headless Mode" plugin's static-file feature.

Tistory (Korea)

Tistory blocks root-file uploads, so direct llms.txt deployment is impossible. The workaround is to enrich your RSS feed's description fields — ChatGPT and Perplexity also consume RSS as answer context.

Naver Blog (Korea)

Same constraint as Tistory. The practical workaround is putting a tight summary plus a link to your own domain in the first paragraph of every post. The cleanest answer is running your own domain alongside Naver and putting llms.txt there.

Next.js / Gatsby / Hugo (static sites)

Place the file in public/llms.txt (Next.js), static/llms.txt (Gatsby), or static/llms.txt (Hugo). It's copied to root at build time. For dynamic regeneration, plug a generator into getStaticProps or your build script.

The other AEO signals to ship alongside

llms.txt alone is not enough. For an LLM to trust your site as a citation-worthy source, you also want:

  1. schema.org JSON-LD — Article, HowTo, and FAQPage for structured semantics
  2. Author info (E-E-A-T) — author, publisher, and dateModified explicit
  3. Canonical tags — clean up duplicates
  4. Mobile-responsive, Lighthouse 90+ — slow pages get skipped by bots
  5. HTTPS + fast response — TTFB under 800ms

On bal.pe.kr we validate all five in CI. That's the single biggest reason the AI-answer share moved from 12% to 38% in six months — llms.txt was the unlock, but the supporting signals did the heavy lifting.

Bottom line — 30 minutes of work, months of compounding upside

In 2026, roughly 30% of search responses embed AI answers. Sites without llms.txt slowly lose that 30%. The good news: it takes 30 minutes to set up, and the upside compounds.

Three things to do today:

  1. Deploy a /llms.txt at your root (30 min)
  2. Add an auto-generation step to your CI (1 hour)
  3. Validate token length with aitoken (5 min)

The Korean-domain adoption rate of 4% will be 10% in a month and 30% in six. The sites that ship before the standard hardens win the compounding citations.

Related tools