How RAG Pipelines Decide Which Sites To Cite (And Why Most Sites Get Skipped)

Most advice on getting cited by AI engines treats the model as a black box. Write good content, the advice goes. Add some schema. Hope for the best. That advice would have been roughly correct in 2023, when GPT-3.5 answered from a fixed training corpus. In 2026, it is wrong in ways that actively hurt sites trying to win citations.

The reason is RAG — retrieval-augmented generation. ChatGPT, Claude, Perplexity, Google AI Overviews, Gemini, and Microsoft Copilot all answer real-time queries by retrieving fresh content from the live web (or a near-live index), feeding it to the language model as context, and asking the model to synthesize an answer from that retrieved material. The model isn't recalling your site. It's reading your site, right now, alongside a handful of others, and choosing which to quote.

If you don't get pulled into the retrieval step, nothing else about your content matters. The model never sees you. This is the single biggest reason "great content" doesn't translate to AI citations: the content never makes it into the working set.

What RAG actually is

Retrieval-augmented generation is the architecture that turns a language model into a search-answering system. It has four stages, and each stage filters out the majority of candidate sites. By the time the language model writes its answer, the field has narrowed from billions of pages to roughly five to fifteen.

The four stages, in order:

Query decomposition — the system breaks the user's question into several sub-queries.
Retrieval — for each sub-query, the system pulls a ranked set of candidate documents from one or more search indexes.
Reranking and chunking — the system reduces those documents to a smaller set of high-relevance passages.
Synthesis — the language model writes an answer using those passages as the source material, citing the ones it actually quotes.

A site needs to survive every stage to be cited. Most sites fail at stage 2 — they aren't retrieved at all. The next biggest failure mode is stage 3 — they're retrieved but reranked out. Few sites that survive stages 2 and 3 get filtered at stage 4. The leverage is in the first half of the pipeline.

Stage 1: query decomposition

When a user asks "what's the best CRM for a small B2B SaaS team," the model doesn't search the web for that exact string. It decomposes the question into a fan-out of related searches — typically three to seven sub-queries. The fan-out might include "best B2B SaaS CRM 2026," "CRM features small team," "Hubspot vs Pipedrive vs Close," "CRM pricing under $50 per user," and so on.

This decomposition is silent. The user never sees it. But it determines which sub-queries your site needs to surface for. If your content only matches one of seven sub-queries, you'll appear in retrieval for one — and probably get outranked by sites that match three or four. The model preferentially cites sources that appear consistently across sub-queries, because that consistency reads as authority on the topic, not just one keyword.

The implication for content strategy is concrete. A single page targeting "best CRM" is structurally weaker than three pages covering "best CRM for B2B SaaS," "CRM pricing comparison," and "Hubspot vs Pipedrive" — even if the standalone page is more thorough. The fan-out rewards topical coverage, not single-page depth.

Stage 2: retrieval

For each sub-query in the fan-out, the system queries one or more search indexes. Which index depends on the engine. ChatGPT primarily uses Bing's web index. Perplexity uses its own crawl plus Bing fallback. Claude (via the Claude search tool) uses a combination of Brave Search and direct fetching. Google AI Overviews uses Google's own index. Microsoft Copilot uses Bing. Gemini uses Google.

This is where most sites disappear. Retrieval at this stage is essentially traditional search: BM25-style keyword matching, semantic embedding similarity, link-graph authority, and freshness signals. If you don't rank in the top 30 or so for the sub-query in the underlying index, you won't be pulled into the candidate set. The 2023 advice that "schema and content quality" alone determine AI citation was always partially wrong, but in 2026 it's mostly wrong — because the retrieval bottleneck happens before the model evaluates anything semantic about your site.

Two specific implications follow. First, indexing in Bing matters far more than most SEOs treat it. Bing's index feeds ChatGPT (87% citation overlap with Bing's top results), Microsoft Copilot, and partial Perplexity traffic. A site that's well-indexed in Google but invisible in Bing is invisible to roughly half of AI citation traffic. Second, freshness matters more than for traditional search — pages updated in the last three to six months retrieve preferentially in most engines' RAG pipelines.

Stage 3: reranking and chunking

Once a candidate set is retrieved (typically 30-100 documents per sub-query), the pipeline reranks them and breaks them into chunks. Reranking is usually done by a smaller, faster language model — a cross-encoder that scores how well each candidate document actually answers the sub-query, given the full text rather than just keywords.

Reranking penalizes pages that are technically about the topic but don't answer the question directly. A 4,000-word "ultimate guide to CRMs" that buries its actual recommendations in section twelve will rerank below a 1,200-word page that opens with "for a small B2B SaaS team, we recommend Hubspot, Close, or Pipedrive — here's the trade-off." The cross-encoder is looking for the answer, not the topic.

Chunking matters because the model doesn't actually see your full page. It sees 200-800 token chunks pulled from your page, selected by the reranker. If your chunk is the part that contains your business name, your concrete recommendation, and a piece of supporting evidence, you're a citation candidate. If your chunk is the part that says "let's first understand what a CRM is," you're not. Write so that any 300-word window from your page could stand alone as an answer. That structural requirement is invisible if you only think about page-level optimization.

Stage 4: synthesis

The remaining chunks — usually 5-15 from across multiple sites — are passed to the main language model along with the original user question and instructions to synthesize an answer. The model then writes the response, citing the specific chunks it draws claims from.

Three things determine which chunks get cited rather than just used for background:

Specificity of claim — chunks with concrete numbers, named entities, or precise statements get cited preferentially over generic claims. "Hubspot's free tier supports 1 million contacts" gets cited; "Hubspot has a generous free tier" gets paraphrased without attribution.
Source diversity — the model tends to cite sources from different domains, not multiple chunks from one site. Three citations to one domain look like bias; one citation each to three domains looks like research.
Conflict resolution — when two retrieved sources disagree, the model often cites the one whose framing it adopts in its synthesis. If you're the source that defines a category or coins a term, you get cited disproportionately compared to sources that use your term without attribution.

Most sites optimize for the model layer (stage 4) when the leverage is in stages 2 and 3. That's a real mistake. Stage 4 is where citation tone is decided, but stages 2 and 3 are where you stop being a candidate.

Five technical realities most advice ignores

Working back through the pipeline, here are the things that actually move citation share in 2026 — and the things most SEO advice still hasn't internalized:

1. Bing index health is foundational, not optional. If your site isn't fully indexed in Bing, ChatGPT can't see it for most queries. Verify your site in Bing Webmaster Tools, submit a sitemap, and watch the AI Performance dashboard (which launched in February 2026) for actual citation counts. Most sites we audit have 30-60% of their pages missing from Bing's index. Each missing page is a hole the retrieval step falls through.

2. Freshness is a stronger signal than backlinks for AI retrieval. The recency bias in RAG retrieval is significant — content older than six months citations drops sharply in our audits. This doesn't mean rewrite everything every quarter. It means update dates on pages where the information is still valid, and substantively refresh pages where the information has changed. The "last modified" signal feeds both Bing and Google retrieval scoring.

3. Schema does help, but not how most articles claim. The benefit isn't that the AI engine "reads" your schema as a special citation source — it's that schema makes your page easier to chunk correctly. FAQ schema and structured data make it more likely that the relevant 300-word window the reranker pulls actually contains your answer rather than your navigation. Schema is chunk-shaping, not citation magic.

4. Site-internal entity consistency outranks raw authority. AI engines build a mental model of "what your site is" by reading multiple pages and looking for consistency in the entities you reference, the categories you place yourself in, and the people/companies you name. A site that calls itself a "CRM" on one page and a "sales platform" on another splits its entity signal. A site that consistently identifies as one entity, in one category, with consistent supporting facts, builds entity weight even if it has fewer backlinks than competitors.

5. The answer must be in the first half of the page. The reranker selects chunks based on how directly they answer the sub-query. Burying your answer below 1,500 words of setup means the chunks selected from your page won't contain the answer. They'll contain the setup. The first 600 words of any page should be self-contained enough that the page could be cited from those alone.

What this means for content strategy

If the pipeline is real, the implications are concrete. Write more pages, not longer pages, when targeting a fan-out topic. Put your answer first, not last. Use schema to make your answer easier to extract. Keep your entity signals consistent across pages. Verify and monitor your Bing index status weekly. Update content dates on pages still valid. Refresh pages every six months substantively, not cosmetically.

None of this contradicts traditional SEO. Most of it overlaps with what good SEO has always been. The difference is that for AI citation, the cost of failing at retrieval is total — you don't get cited at all — while in traditional SEO the cost is graceful degradation in rankings. The new game is much more all-or-nothing, and the threshold to get pulled into retrieval is the lever that matters most.

How to know if your site is making it through retrieval

The Bing Webmaster Tools AI Performance dashboard (launched February 10, 2026) is the only first-party citation tracking that currently exists. It shows total citations, average cited pages, grounding queries (sub-queries Bing surfaced your site for), and per-page citation data. If a page shows zero grounding queries over 30 days, the retrieval step isn't finding it. That's diagnostic.

For the engines that don't have first-party tools — Claude, Perplexity, ChatGPT (directly), Gemini, Google AI Overviews — you have to run audit prompts manually or use a tool that runs them at scale. Either approach starts with the same step: pick the 20 sub-queries most relevant to your business, run each against each engine, and note whether you're cited. The pattern of misses tells you exactly which stage of the pipeline you're failing at.

Where Reffed fits

Reffed runs the audit step automatically. We crawl your site, check your Bing and Google index health, run a set of category-specific prompts against the six major AI engines, and report which prompts cite you, which cite competitors, and which return nothing useful. The audit also flags structural issues — answer-first failures, chunk-killing layouts, entity inconsistencies, schema gaps — that map directly to the four pipeline stages above. You can run a free audit on any site, no signup.

If you want to dig deeper, the free Foundations course (5 lessons, 45 minutes) walks through the RAG pipeline and the 8 citation prompts every business should win. The paid Quickstart ($147 founding) covers the full execution playbook across all six AI engines.

Try the audit

See exactly which RAG stages your site fails at. Free, 60 seconds, no signup.

How RAG pipelines decide which sites get cited.