Skip to main content
Product Deep Dive

Inside the SOAR-V Pipeline, Part 2: The Citation Resolution Service

Admin Admin
2/6/2026
Inside the SOAR-V Pipeline, Part 2: The Citation Resolution Service — From Link to Proof

 

This is Part 2 of our five-part series on the SOAR-V verification pipeline. In Part 1, we covered the Claim Extraction Engine — the module that identifies every verifiable assertion in a report. Today we're examining what happens next: the Citation Resolution Service, or CRS.

 

The CRS has a deceptively simple job description: take a citation reference attached to a claim and retrieve the actual source content. In practice, this is one of the most technically challenging modules in the pipeline — and the one where most "fact-checking" systems quietly fail.

 

What Is the Citation Resolution Service?

 

The CRS is the bridge between a claim's citation marker and the real-world content that citation points to. When a report says "Revenue grew 23% year-over-year (Source: Acme Corp Q3 2025 Earnings Report)," the CRS is the module that actually retrieves Acme Corp's Q3 2025 earnings report, extracts the relevant text, and packages it for the Content Grounding Analyzer to evaluate.

 

This matters because a citation existing is not the same as a citation being correct. The HalluHard benchmark drew a sharp distinction between two types of hallucination failures:

 

Reference-grounding failures: The source doesn't exist at all. The URL leads nowhere, the paper was never published, the report was fabricated. Web search integration has reduced these significantly — when a model can search, it's less likely to invent sources entirely.

 

Content-grounding failures: The source exists, but the AI misrepresents what it says. This is the more dangerous and more persistent failure mode. The link works. The document is real. But the claim either overstates, misattributes, decontextualizes, or subtly distorts what the source actually contains.

 

The CRS exists to enable detection of content-grounding failures. You can't check whether a claim accurately represents a source if you don't have the source text. And getting that source text — reliably, at scale, across dozens of formats and access conditions — is harder than it looks.

 

Why Most Fact-Checking Systems Fail Here

 

The typical AI fact-checking approach works like this: the model includes a URL in its response, and the fact-checker visits that URL to see if it exists. If the URL resolves to a real page, the citation is considered "verified."

 

This catches reference-grounding failures. It completely misses content-grounding failures. The URL being real tells you nothing about whether the claim accurately represents what's on the page.

 

A more sophisticated approach fetches the page and does a keyword match — does the source page contain the same terms as the claim? But keyword matching can't detect exaggeration, misattribution, or decontextualization. The words might all be there while the meaning is completely different.

 

The CRS goes further. It doesn't just check that a source exists or contains matching keywords. It retrieves the full source text, identifies the specific section most relevant to the claim, and packages both the full text and the targeted excerpt for deep semantic analysis in the next module.

 

How the CRS Works

 

Step 1: Citation Parsing

 

Every claim extracted by the CEE includes a citation reference — a pointer to a source in the report's research corpus. The CRS parses this reference and matches it to a source record in our database. Each source record contains the URL, title, author, publication date, and any metadata collected during the research phase.

 

Step 2: Source Retrieval

 

This is where the complexity lives. Sources come in many formats and access conditions, and the CRS must handle all of them.

 

Web pages (HTML): The most common source type. The CRS fetches the URL and applies a readability algorithm to strip navigation bars, advertisements, sidebars, cookie banners, and other chrome — leaving only the article content. This cleaned text is what gets evaluated, not the raw HTML. Without this cleaning step, a verification check might match against ad copy or navigation text instead of the actual article content.

 

PDF documents: Academic papers, annual reports, regulatory filings, and technical whitepapers are frequently published as PDFs. The CRS uses layout-aware PDF parsing that handles multi-column layouts, embedded tables, figures with captions, headers and footers, and footnotes. A naive PDF-to-text conversion produces garbled output from a two-column academic paper; our parser preserves the logical reading order.

 

Financial filings: SEC filings (10-K, 10-Q, 8-K), earnings transcripts, and investor presentations follow specific formatting conventions. The CRS has specialized parsers for these document types that extract structured data — revenue figures, growth rates, segment breakdowns — in a way that preserves the numerical context needed for verification.

 

Regulatory documents: FDA guidance documents, EU regulations, compliance frameworks, and similar government publications often use nested section numbering, cross-references, and defined terms. The CRS preserves this structure because a regulatory claim often references a specific section or subsection, and verification requires matching the claim against the correct subsection, not just the document at large.

 

Cached content: During the research phase, our specialist agents access and cache the full text of every source they use. If a source becomes unavailable at verification time — the page goes behind a paywall, the URL changes, the server is temporarily down — the CRS falls back to the cached version from research time. This ensures that verification isn't blocked by transient access issues.

 

Step 3: Failure Handling

 

Not every source resolves successfully, and the CRS categorizes failures specifically:

 

404/Dead link: The URL no longer exists. This is a reference-grounding failure — the claim cited a source that can't be found. The claim is flagged immediately.
Paywall/Authentication: The content exists but requires paid access. The CRS attempts cached content first, then tries alternative access paths (Google Scholar cached versions, Internet Archive snapshots). If all fail, the claim is flagged as "source inaccessible" rather than "source doesn't exist."
Parse failure: The document was retrieved but couldn't be meaningfully parsed (corrupted PDF, JavaScript-heavy page that didn't render, binary file). Flagged for manual review.
Format mismatch: The URL points to something that isn't a document — an image, a download link, a video. Flagged as an unexpected source type.

 

Each failure type routes differently in the pipeline. Reference failures go directly to the Researcher Escalation Queue. Access failures trigger re-research to find alternative sources. Parse failures get manual attention.

 

Step 4: Excerpt Identification

 

Once the full source text is retrieved, the CRS identifies the most relevant excerpt — the specific section that the claim is most likely drawing from.

 

This uses embedding similarity. The CRS generates a vector embedding of the claim text and compares it against embeddings of each paragraph or section in the source document. The section with the highest cosine similarity is tagged as the primary excerpt.

 

This excerpt identification is critical for the Content Grounding Analyzer. The CGA needs to compare the specific claim against the specific part of the source that's relevant. Checking a claim about Q3 revenue against the full text of a 40-page annual report would produce noise. Checking it against the specific paragraph that discusses Q3 revenue produces a precise evaluation.

 

The CRS outputs both the full source text (for context) and the targeted excerpt (for focused verification). The CGA uses both.

 

How This Exceeds Expectations

 

No dead-link surprises. Every source cited in a PromptReports deliverable has been resolved and retrieved at verification time. Users will never click a citation and hit a 404, because the CRS would have caught that before the report was delivered. If a source couldn't be resolved, the claim is either re-researched with an alternative source or flagged visibly — never silently left with a broken link.

 

Format-agnostic verification. Users requesting reports about regulated industries (healthcare, finance, legal) frequently need sources from government databases, regulatory filings, and academic papers — all of which are commonly published as PDFs. The CRS handles these with the same rigor as web articles, so verification quality doesn't degrade when sources are in complex formats.

 

Excerpt precision. When a user clicks through a verification result, they don't see the entire source document and have to hunt for the relevant passage. They see the specific excerpt that the claim draws from, highlighted and contextualized. This transforms the verification experience from "here's the source, good luck finding the relevant part" to "here's exactly what the source says about this specific assertion."

 

Resilience against link rot. The internet changes constantly. Pages move, get restructured, go behind paywalls, or disappear. Because the CRS caches source content during research and falls back to cached versions when needed, verification results remain stable even if the original source URL changes after the report is generated.

 

Real-World Use Cases

 

Use Case 1: Academic Research Synthesis. A pharmaceutical company needs a report synthesizing recent clinical trial data for a therapeutic area. Many sources are academic papers published as PDFs on journal websites. The CRS retrieves each paper, parses multi-column layouts with embedded statistical tables, and identifies the specific sections containing the efficacy data cited in the report. The CGA can then verify that the report's claim about "72% response rate in the Phase III trial" actually matches the 72% figure in Table 3 of the original paper — not an adjacent figure from a different study arm.

 

Use Case 2: Competitive Intelligence with Pricing Data. A sales operations team needs verified competitor pricing data. The report cites competitor pricing pages, published rate cards, and press releases about pricing changes. The CRS fetches each pricing page (which are often JavaScript-heavy and dynamically loaded) and extracts the structured pricing data. When a competitor later changes their pricing page, the cached version preserves what the page said at the time the report was generated — creating a verifiable snapshot.

 

Use Case 3: Regulatory Compliance Research. A fintech company needs a report on anti-money laundering regulations across multiple jurisdictions. Sources include FinCEN guidance documents, EU regulatory texts, and UK FCA publications — many of which are lengthy PDFs with nested section numbering. The CRS retrieves each document, identifies the specific section referenced by each claim (e.g., "Section 4.2.3 of the 2025 FinCEN guidance"), and packages the exact subsection for CGA evaluation. This precision is essential because regulatory claims must reference the correct section, not just the correct document.

 

Use Case 4: Technology Vendor Assessment. A CTO needs a report evaluating cloud infrastructure vendors. Sources span vendor documentation (often sprawling, multi-page technical docs), third-party benchmark reports (sometimes behind registration walls), and community discussions. The CRS handles the breadth of source types: parsing structured documentation, accessing cached versions of gated content, and extracting relevant technical specifications from dense feature comparison tables.

 

Use Case 5: M&A Due Diligence Support. An investment bank needs a market landscape report for a potential acquisition target. Sources include SEC filings (10-K, proxy statements), investor presentations, earnings call transcripts, and news coverage. The CRS uses specialized financial document parsers to extract precise figures from the standardized tables in SEC filings — ensuring that a claim about "operating margin of 23.4%" can be verified against the exact line item in the 10-K, not against a press release summary that might round differently.

 

What Comes Next

 

The CRS's output — resolved source text with targeted excerpts for every claim — feeds directly into Module 3: the Content Grounding Analyzer. The CGA is where actual verification happens: checking whether the claim accurately represents what the source says through three stages of analysis (Relevance, Support, and Fidelity).

 

The CRS makes the CGA possible. Without reliably retrieved, cleanly parsed, precisely excerpted source content, the CGA would have nothing meaningful to analyze. The quality of verification is directly proportional to the quality of source resolution.

 

In Part 3, we'll break down the Content Grounding Analyzer — the three-stage engine at the heart of the verification pipeline.

 

Verified claims start with verified sources. [Generate your first report →](/register)