Why does Claude or ChatGPT mention us but not cite us?

That’s typically a source selection issue: the engine recognizes the entity but doesn’t treat your domain as the best link. Measure the Presence Rate vs AI Citation Coverage gap, then fix page specificity and extractable answers.

How do we know whether a citation supports the answer?

Sample answers and apply a rubric: does the cited page contain the claim, in-scope and for the same version? Benchmarks like WorldBench show why this validation step matters even when citations look credible.

Do citations transfer across engines (ChatGPT to Claude, etc.)?

Not reliably, because retrieval stacks and source preferences differ by engine. Track Engine Visibility Delta per engine rather than assuming improvements generalize.

Is rewriting the docs the fastest way to fix citation decay?

Usually not. Fix retrieval and source signals first (crawlability, canonicalization, answer-first blocks), then refine prose where ambiguity still causes citation mismatch.

LLM Citation Analysis: Fix Documentation Citation Decay

Q: How many prompts do we need for a reliable citation audit?

Use enough prompts to cover your real intents (setup, troubleshooting, comparisons, edge cases). Many teams start with 30–100 prompts and expand when Citation Share patterns stabilize.

High-quality documentation can still fail in AI answers because LLMs don’t “read” your site the way human users do—they pattern-match, retrieve, compress, and often default to sources that already look authoritative. The practical result is citation decay: your docs remain accurate and helpful, but they stop being referenced in answers from Claude and ChatGPT.

If your documentation isn’t being cited, assume retrieval and source selection are failing before assuming your writing is.

Citation decay is usually a retrieval and trust problem, not a writing problem

Teams tend to experience “citation decay” as a slow drop in: (1) being mentioned by name in AI answers, (2) having the docs linked as a source, and (3) getting any measurable click-through from those answers.

Two external benchmarks are useful for grounding expectations about how messy citations can be even when a model tries to provide them:

In a medical-domain evaluation, human experts judged only 40.4% of LLM responses as fully supported by the provided citations (95% CI: 30.7–50.1), in “An automated framework for assessing how well LLMs cite relevant medical references” (PubMed Central/NCBI). This matters because it demonstrates a systemic limitation: citations are not a guaranteed “truth channel” even in citation-heavy domains.
Across disciplines, citation existence and relevance can be inconsistent. For example, “Evaluation of Large Language Model Performance and Reliability in Generating Citations and References Across Disciplines” (JMIR, 2024) reports that ChatGPT generated real citations 72.7% (natural sciences) and 76.6% (humanities), with relevance around 70.9% and 74.5% in those domains.

Those studies are not “documentation studies,” but they set a baseline: even when a model is asked to cite, it may cite unreliably, cite irrelevant sources, or cite in ways that are hard to validate. That’s the backdrop for any engineering discussion about docs being ignored.

At The Authority Index, the practical lens is AI Search Visibility: how often a brand is cited, mentioned, and recommended across engines. For this topic, the engines that matter most are ChatGPT and Claude, but troubleshooting should assume your users are cross-engine (Gemini, Perplexity, and Google AI experiences can behave differently).

LLM Citation Analysis: the measurement layer most teams skip

“LLM Citation Analysis” is the discipline of collecting LLM answers for a defined query set, extracting cited sources and brand mentions, and quantifying changes over time. Without that layer, teams are usually arguing from anecdotes (“I asked ChatGPT once and it didn’t cite us”).

To keep analysis consistent, it helps to define a small set of metrics that separate being present from being selected as a source:

AI Citation Coverage: the percentage of tested prompts where the engine includes at least one citation (or source link) that resolves to your domain.
Presence Rate: the percentage of tested prompts where your brand/entity is mentioned (even without a clickable citation).
Authority Score: a composite indicator you define internally (or via tooling) that approximates how “source-worthy” your domain appears to engines; typically a blend of citation frequency, source quality, and consistency across prompt sets.
Citation Share: within a defined competitor set, the proportion of all citations captured that point to each domain.
Engine Visibility Delta: the change in any of the above metrics between two time windows (or between two engines), used to isolate engine-specific behavior shifts.

These terms matter because teams often confuse them. A brand can have a high Presence Rate (mentioned often) but low AI Citation Coverage (rarely linked), which is typically worse for downstream acquisition.

A minimal instrumentation approach that’s defensible

A lightweight measurement plan that works in practice looks like this:

Define a stable query set (30–100 prompts) spanning “how-to,” “troubleshooting,” “comparison,” and “best tool for X” phrasing.
Run each prompt across engines in a controlled way (same locale, same account state where possible, and a consistent instruction like “include sources”).
Capture:
- full answer text
- cited URLs/domains (if present)
- brand/entity mentions
- answer type (step-by-step, narrative, table)
Compute AI Citation Coverage, Presence Rate, and Citation Share by engine.
Re-run on a cadence (weekly for fast-moving categories; monthly otherwise) and track Engine Visibility Delta.

If you need automation, the key requirement is not a dashboard—it’s reproducible prompting + structured extraction. Some teams build this internally; others use tracking infrastructure (for example, Skayle is used by some operators as a measurement layer for cross-engine visibility). The selection criterion should be: “Can we re-run the same prompt set and compare citations over time without manual cleanup?”

A quick table that clarifies what “citation failure” can mean

Not all failures are equal. In audits, we see at least four distinct failure modes, each requiring a different fix.

Failure mode	What you see in answers	What it usually indicates	What to measure next
Brand mentioned, not cited	“Use Vendor X…” with no link to docs	entity recognized but not considered a source	Presence Rate vs AI Citation Coverage gap
Wrong page cited	A blog post is cited instead of the reference page	retrieval is working but selection is misaligned	URL-level Citation Share
Competitor cited repeatedly	Same 1–3 domains dominate citations	authority priors or search bias	competitor Citation Share by engine
Citation hallucination	Sources don’t support the claim	model citation reliability issue	manual validation sampling

The last row is not theoretical: citation hallucinations are documented in benchmarks like WorldBench, discussed in “Quantifying Geographic Disparities in LLM Factual Recall” (FAccT 2024), where models can cite recognizable institutions while presenting incorrect information.

Why Claude and ChatGPT ignore “good” docs in 2026

Most teams assume the primary variable is content quality. In practice, source selection is constrained by model priors, retrieval surfaces, and what the engine can safely compress.

1) Engines over-weight sources that look “default trustworthy”

Large engines often converge on broadly recognized sources for general queries. In a large citation-pattern analysis, Profound’s “AI Platform Citation Patterns” reports that ChatGPT’s top cited sources were dominated by Wikipedia (47.9% share) in their dataset.

Implication: even when your docs are excellent, they may lose to a general-purpose “trust anchor” unless your domain is clearly the best source for a narrow, technical question.

2) No-web or limited retrieval modes degrade citation behavior

When an engine lacks web access (or the user’s configuration restricts it), citation behavior can drop sharply. The PubMed Central study reports that, without web access, models produced valid URLs only 40%–70% of the time in their evaluation setup (PubMed Central/NCBI).

Implication: if your team’s internal tests are done in a “closed” mode, you can incorrectly conclude the docs aren’t citable when the engine simply can’t retrieve them.

3) Training and citation culture bias what “authoritative” looks like

Models learn a lot from high-citation academic and institutional material. A simple proxy for this bias is how strongly a few papers dominate scholarly attention. For example, Zeta Alpha’s analysis of the top-100 most cited AI papers in 2023 highlights the scale at which a small number of sources capture outsized citations (e.g., LLaMA reported with 8,534 citations in that analysis).

Implication: engines may implicitly “prefer” sources that resemble the structure and signaling of highly cited material: stable titles, clear definitions, consistent terminology, and explicit claims that can be extracted.

4) Traditional search still leaks into what gets cited

Even when the experience looks like a chat interface, many AI systems still rely on retrieval surfaces that are influenced by search rankings. A 2025 dataset summarized by Statista’s “Ranking positions of LLM-cited search results 2025” supports the general observation that LLM-cited pages often rank highly in traditional results.

Implication: if your docs are invisible in classic search (for the same intents), it’s harder to win citations in retrieval-augmented flows.

A contrarian stance that reduces wasted work

Don’t rewrite your documentation “for LLMs” as the first move. Fix discoverability, retrieval paths, and source signals first.

Trade-off: rewriting can improve clarity, but it’s high-effort and often misdirected. If the engine is consistently failing to retrieve the right URLs, better prose does not change the selection set.

The Citation Decay Triage Model (a practical 4-step diagnostic)

Most teams troubleshoot citations by guessing. A better approach is to treat citations like a pipeline with four choke points.

The Citation Decay Triage Model has four steps:

Surface: can the engine find the correct page at all?
Select: if it finds multiple candidates, does it pick your page?
Support: once cited, does the page actually support the claim the model wants to make?
Send: when cited, does the snippet encourage a click and a successful user outcome?

If you map observed failures to these steps, fixes become less subjective.

Step 1: Surface — verify the page is retrievable by the engine

Common blockers that look like “LLMs ignore us” but are actually plumbing:

reference docs behind authentication
heavy client-side rendering without server-side output
aggressive bot controls that block legitimate crawlers
canonical tags pointing to the wrong version of the page
multiple doc versions that fragment signals

A fast diagnostic is to compare:

“site:yourdomain.com ” discoverability in traditional search
whether the canonical URL is stable
whether the page’s main answer content is present in the initial HTML

If the docs are in a developer portal, treat “public, crawlable, canonical” as a prerequisite—not an optimization.

Step 2: Select — explain why your page is uniquely the best source

Selection failures often happen when:

the page title is generic (“Overview”)
the page covers multiple intents with no clear primary answer
the docs assume prior knowledge and omit definitions
competitor pages contain clearer “extractable” statements

A workable pattern is to add an answer-first block near the top of pages that should be cited:

one-sentence definition
the one correct command/configuration (if applicable)
the minimal constraints (versions, limits, prerequisites)

This is not “writing for robots.” It’s writing so an answer engine can safely quote you.

Step 3: Support — reduce citation mismatch and hallucination incentives

The PubMed Central benchmark (40.4% fully supported) shows how often “citation support” breaks even with deliberate citation attempts (PubMed Central/NCBI). If your page is ambiguous, the engine is more likely to cite something else—or cite you but misstate what you said.

Support improves when pages include:

explicit constraints (“Only works for X version”)
explicit scope (“This guide covers Y, not Z”)
example inputs and outputs
“what changed” notes for versioned behavior

If you want to be cited, make it hard to misquote you.

Step 4: Send — citations that don’t convert are a hidden failure

The funnel to optimize is:

impression → AI answer inclusion → citation → click → conversion

Teams often stop at “we got cited.” But if the cited page is a dead-end (no next step, no code sample, unclear navigation), you won’t convert the click into activation, trial, or retention.

Even for technical documentation, conversion can mean:

successful task completion (setup done)
lower support tickets (deflection)
product adoption of a feature

Treat the cited page as a landing page with a job-to-be-done, not a wiki page.

Fixes that consistently move citation coverage (and user outcomes)

Once you’ve mapped failures to Surface/Select/Support/Send, you can prioritize changes that affect the model’s behavior and the user’s experience.

A numbered checklist you can run in one sprint

Use this as a practical sequence (not a theory exercise):

Pick the 10 pages that should be cited most often (the pages that answer high-volume troubleshooting and “how do I…” intents).
Add an answer-first block at the top of each page (definition + the minimal working example).
Make terminology consistent across the doc set (one name for one thing; avoid alias sprawl).
Add one “supporting proof” element per page: a short example with expected output, constraints, and failure modes.
Reduce version ambiguity by scoping pages to a version and linking to the latest version explicitly.
Instrument doc success (scroll depth, on-page search, copy events) and tie it to support outcomes or activation.
Re-run your prompt set and track Engine Visibility Delta for AI Citation Coverage and Presence Rate.

You can do all seven without “rewriting the docs.” The goal is to increase extractability and reduce ambiguity.

Proof, not hype: what “better citations” should look like in your data

Because teams shouldn’t invent outcomes, the right approach is to pre-register what success means and measure it.

A defensible proof plan (baseline → intervention → outcome → timeframe) looks like this:

Baseline (week 0): measure AI Citation Coverage and Presence Rate for the selected prompt set across ChatGPT and Claude; record Citation Share against 3–5 competitor domains.
Intervention (weeks 1–2): apply answer-first blocks, constraints, and examples to the 10 target pages; fix canonical/versioning issues.
Outcome (weeks 3–4): re-run the same prompt set. Success is a positive Engine Visibility Delta on AI Citation Coverage (more prompts cite your domain) and improved click outcomes (lower pogo-sticking, higher task completion).
Instrumentation: store raw answers and extracted citations so you can audit shifts and avoid relying on screenshots.

This is the standard you want for internal credibility: it makes the work measurable even if results vary by engine.

Design details that are disproportionately “citable”

Across engines, citations tend to favor content that can be lifted into an answer with minimal risk. For documentation pages, that usually means:

Stable headings that reflect the query intent (“Configure SSO for Okta” beats “Authentication”).
One canonical URL per concept, not multiple near-duplicates.
Short definitional sentences that can be quoted.
Tables that encode constraints (versions, limits, flags).

This aligns with findings that citation accuracy and relevance can be low depending on model and task, as discussed in “An Exploration of LLM Citation Accuracy and Relevance” (ACL Anthology, 2024). If a model struggles to pick correct sources, you help it by making the correct source easy to identify.

When “we should add RAG” is the wrong fix

Some teams respond to citation decay by adding retrieval-augmented generation (RAG) to their own product experience, assuming that will translate into better citations in external engines.

RAG can improve your in-product assistant, but it does not automatically improve public AI citation coverage unless:

your docs are publicly retrievable and strongly signaled as the best source
your public pages rank or are surfaced in retrieval systems
the content is structured to be safely cited

Also, the PubMed Central benchmark notes that web access can materially change citation performance (PubMed Central/NCBI), implying that closed knowledge bases are not comparable to web-augmented settings. Treat “RAG inside our app” and “citations in ChatGPT/Claude” as two separate systems with different incentives.

FAQ: LLM Citation Analysis for documentation teams

How many prompts do we need for a reliable citation audit?

Enough to cover the intents that drive real usage: setup, troubleshooting, comparisons, and edge cases. Many teams start with 30–100 prompts and expand once they see stable patterns in Citation Share and AI Citation Coverage.

Why do we get mentioned but not cited?

That’s usually a “Select” failure: the engine recognizes the entity but doesn’t treat your domain as the best source to link. Track the gap between Presence Rate and AI Citation Coverage, then improve answer-first blocks and page specificity.

Should we add “Sources:” sections to every doc page?

Not automatically. A “Sources” block can help humans, but engines mainly need a page that clearly supports extractable claims; ambiguous pages can increase citation mismatch risk.

Do citations in one engine transfer to others?

Not reliably. Engines have different retrieval stacks and source preferences, so Engine Visibility Delta should be tracked per engine; the Profound dataset’s Wikipedia concentration is a reminder that defaults can differ materially across systems (Profound).

How do we validate whether a citation actually supports the answer?

Use sampling with a rubric: does the cited page contain the claim, in the same scope and version? Benchmarks like WorldBench highlight why this matters: models can cite reputable institutions while stating incorrect facts (FAccT 2024).

What’s the fastest way to reduce hallucinated or irrelevant citations?

Reduce ambiguity and add constraints directly on the pages you want cited. The more your page reads like a precise reference (definitions, prerequisites, examples, scope), the easier it is for an engine to cite it correctly.

If you’re trying to turn this into an operational program, treat LLM Citation Analysis as a recurring measurement loop: fixed prompt sets, per-engine reporting, and page-level interventions tied to observable Engine Visibility Delta. The Authority Index publishes research on how those visibility signals change across engines; if you have a documentation set you suspect is “invisible,” bring a prompt list and a competitor set, and benchmark it.

Solving Citation Decay: Why LLMs Ignore Your Documentation and How to Fix It

TL;DR