Skip to main content
Product Deep Dive

Inside the SOAR-V Pipeline, Part 1: The Claim Extraction Engine

Admin Admin
2/6/2026
Inside the SOAR-V Pipeline, Part 1: The Claim Extraction Engine — Finding Every Assertion That Matters

 

This is the first post in a five-part series exploring each module of PromptReports.ai's SOAR-V verification pipeline — the system that checks every factual claim in every report we deliver. Today we're looking at the first module in the chain: the Claim Extraction Engine, or CEE.

 

Before you can verify a report, you have to know what to verify. That sounds obvious, but it's the step that every other AI research tool skips entirely — and it's the step that makes everything else in the pipeline possible.

 

What Is the Claim Extraction Engine?

 

The Claim Extraction Engine is the module that reads a completed research report and identifies every atomic factual assertion that can be independently checked against a source. It doesn't verify anything itself. It creates the map of everything that needs to be verified.

 

Think of it this way: a research report isn't one big claim. It's dozens — sometimes hundreds — of individual assertions woven together into a narrative. Some of those assertions are factual statements that are either right or wrong. Others are analytical conclusions, hedged speculation, structural transitions, or opinion. The CEE's job is to distinguish between these categories and pull out exactly the assertions that have verifiable answers.

 

A single paragraph in a market analysis might contain:

 

"The global observability market reached $41 billion in 2025 and is projected to grow at a 15.8% CAGR through 2028, according to Gartner. Cribl has emerged as a disruptive force in this segment, with its data routing platform now processing over 2 petabytes of daily data for enterprise customers. This growth trajectory suggests the market is entering a consolidation phase, particularly as legacy SIEM vendors struggle to compete on data pipeline flexibility."

 

A human reading this sees one paragraph. The CEE sees five distinct items:

 

Extracted Item | Classification | Verifiable?
"The global observability market reached $41 billion in 2025" | Statistical claim | Yes — Critical priority
"Projected to grow at a 15.8% CAGR through 2028" | Statistical claim | Yes — Critical priority
"According to Gartner" | Attribution | Yes — High priority
"Processing over 2 petabytes of daily data" | Statistical claim | Yes — Critical priority
"The market is entering a consolidation phase" | Analytical conclusion | No — this is synthesis, not a verifiable fact
"Legacy SIEM vendors struggle to compete on data pipeline flexibility" | Analytical opinion | No — hedged analytical statement

 

The CEE extracts the first four items for verification. The last two are analytical conclusions — they represent the report author's interpretation, not factual claims that can be checked against a source. Sending analytical statements through the verification pipeline would produce false failures and waste processing resources.

 

This distinction is critical. Verification should target facts, not opinions. The CEE ensures the pipeline focuses its verification energy on the claims that actually have right-or-wrong answers.

 

Why PromptReports.ai Needs This

 

Without the CEE, verification is impossible at scale. You have two alternatives, and neither works.

 

Manual extraction is what a careful human reader does: scan the report, mentally flag claims that seem important, and spot-check a few of them against sources. This catches some errors, but it's inconsistent, incomplete, and heavily biased toward whatever the reader happens to notice. In our testing, human readers typically identify 30-40% of the verifiable claims in a 3,000-word report. The other 60-70% go unchecked.

 

Whole-document verification treats the entire report as a single unit and asks "is this document accurate?" This is what most AI fact-checking tools attempt, and it produces vague, unreliable results. A report can be "mostly accurate" while containing three fabricated statistics and a misattributed quote. Whole-document approaches lack the granularity to find specific failures.

 

The CEE solves both problems. It systematically identifies every verifiable claim — not just the ones that seem suspicious or important — and tags each one with enough metadata for the downstream modules to verify efficiently. Nothing slips through because the extraction is exhaustive. Nothing wastes resources because analytical statements are correctly excluded.

 

How It Works Under the Hood

 

The CEE operates in four stages.

 

Stage 1: Section Parsing

 

The report is split into its structural sections: executive summary, market analysis, competitive landscape, regulatory environment, risk assessment, methodology, and so on. Each section is processed independently because different sections tend to contain different types of claims. An executive summary is dense with statistical claims. A methodology section contains few verifiable assertions. A risk assessment is heavy on analytical conclusions that shouldn't be extracted.

 

Stage 2: Atomic Claim Identification

 

Within each section, the CEE uses a language model with a specialized prompt to identify atomic claims — the smallest individual assertions that can be independently verified. This is the core extraction step.

 

The prompt is engineered to be precise about what constitutes a verifiable claim versus what doesn't:

 

Extract: Specific numbers, dates, percentages, market sizes, growth rates. Named attributions ("According to Gartner," "McKinsey reports that"). Comparative statements with specific metrics ("40% higher than competitor X"). Temporal assertions ("acquired in March 2024"). Existence claims ("The company offers a free tier").

 

Do not extract: Hedged language ("may," "could," "suggests," "is likely to"). Analytical conclusions ("This trend indicates market maturation"). Structural transitions ("In the following section, we examine"). Definitions and explanations. Opinions and recommendations.

 

Stage 3: Claim Classification

 

Each extracted claim is classified along two dimensions.

 

Claim type determines how the downstream verification modules will evaluate it:

 

Statistical: Numbers, percentages, market sizes, growth rates. These require exact-match or close-match verification against source data. Example: "Revenue grew 23% year-over-year."
Attributive: Statements attributed to a specific person, organization, or publication. Verification checks whether the attribution is accurate. Example: "Gartner recommends Cribl as a Visionary."
Causal: Claims about cause-and-effect relationships. Verification checks whether the cited source actually supports the causal link. Example: "Increased regulatory pressure drove cloud adoption."
Comparative: Claims that compare two or more entities on a specific metric. These are high-risk because slight numerical errors create misleading comparisons. Example: "Datadog's pricing is approximately 40% higher than open-source alternatives."
Temporal: Claims about when something happened. Verification checks dates, quarters, and sequences of events. Example: "The acquisition closed in Q3 2025."
Existential: Claims about whether something exists or is true. Example: "The platform supports SOC 2 Type II compliance."
Analytical: Interpretive statements that represent synthesis rather than fact. These are tagged but excluded from verification. Example: "The market appears to be entering a consolidation phase."

 

Priority level determines how heavily each claim weighs in the report's overall Verification Score:

 

Critical: Statistical and comparative claims. These are the most commonly fabricated and the most damaging when wrong. A wrong market size number in an investor presentation is a serious problem.
High: Attributive, causal, and temporal claims. Misattributing a finding to the wrong source or getting a date wrong undermines credibility.
Medium: Existential claims. Getting a product feature wrong is problematic but usually less consequential than a wrong number.
Low: Analytical claims that made it through extraction despite being borderline. These receive light verification and minimal weight in the overall score.

 

Stage 4: Deduplication and Citation Mapping

 

Reports often state the same fact in multiple places — once in the executive summary and again in the detailed analysis. The CEE deduplicates these so the verification pipeline doesn't waste resources checking the same claim twice. It also maps each claim to its inline citation reference, creating the link between claim and source that the Citation Resolution Service needs in the next step.

 

The output is a structured claim manifest: a list of every verifiable claim in the report, classified by type and priority, deduplicated, and linked to its citation. This manifest is what drives the rest of the SOAR-V pipeline.

 

How This Exceeds Expectations for Users

 

Most users don't think about claim extraction as a feature — they think about the end result: a report with green verification checkmarks. But the CEE is what makes the difference between superficial fact-checking and genuine verification.

 

Exhaustive coverage. Users don't have to wonder which claims were checked and which were quietly skipped. Every verifiable assertion in the report goes through the pipeline. When you see a Verification Score of 0.92, that number reflects every factual claim, not a cherry-picked subset.

 

Intelligent prioritization. Not all claims are equally important. The CEE's classification system ensures that the claims most likely to be wrong (statistical, comparative) and most damaging if wrong (numbers going into business decisions) receive the most rigorous verification and the heaviest weight in the overall score. A report doesn't get a high score by verifying 20 easy existential claims while glossing over 3 critical statistical claims.

 

No false alarms on analysis. By correctly excluding analytical conclusions from verification, the CEE prevents the frustrating experience of seeing an analytical insight flagged as "unverified" simply because it's an interpretation rather than a fact. Users can trust that red flags mean actual factual problems, not stylistic disagreements with the verification system.

 

Transparency. Users can see the full claim manifest — every extracted claim, its type, its priority, and its verification result. This isn't a black box that says "verified" or "not verified." It's a detailed ledger of exactly what was checked and how it scored.

 

Real-World Use Cases

 

Use Case 1: Investor Due Diligence Report. An investment firm requests a deep research report on a SaaS company's competitive position. The CEE extracts 34 claims from the resulting report, including 12 Critical statistical claims about revenue, growth rates, market share, and customer metrics. These 12 claims get the highest verification rigor. The firm's analysts can click through each one and see exactly what the source says — no more taking the AI's word for numbers that will influence a $50M investment decision.

 

Use Case 2: Regulatory Landscape Analysis. A healthcare technology company needs to understand FDA guidance on AI-powered diagnostic tools. The report contains numerous attributive claims ("The FDA's 2025 guidance states that...") and temporal claims about regulatory timelines. The CEE classifies each attribution as High priority and maps it to the specific FDA document cited. The CGA will verify that the report accurately represents what the FDA document actually says — critical when regulatory misinterpretation can delay a product launch by months.

 

Use Case 3: Competitive Battlecard for Sales Teams. A sales team needs accurate competitive positioning data to use in customer conversations. The report is dense with comparative claims: pricing comparisons, feature comparisons, and market positioning statements. The CEE flags every comparative claim as Critical because a sales rep quoting an inaccurate price comparison in a customer meeting is both embarrassing and potentially illegal. Each comparison is verified against the competitor's actual pricing page or published documentation.

 

Use Case 4: Market Entry Strategy. A technology company exploring a new geographic market needs a report covering local regulations, competitive dynamics, and market sizing. The CEE extracts a diverse mix of claim types: statistical (market size), regulatory (compliance requirements), temporal (when regulations take effect), and existential (which competitors operate in the market). The claim manifest ensures that every dimension of the market entry analysis is verified, not just the headline market size figure.

 

Use Case 5: Board-Ready Technology Assessment. A CTO needs a technology assessment report that will be presented to the board. The CEE identifies every claim that board members might question: vendor market positions, technology adoption statistics, cost comparisons, and implementation timeline estimates. By verifying all of these before the report reaches the CTO's hands, PromptReports ensures that no board member can poke a hole in the data that the CTO can't immediately answer with click-through source proof.

 

What Happens Next

 

The CEE's output — the structured claim manifest — feeds directly into Module 2: the Citation Resolution Service. The CRS takes each claim's citation reference and retrieves the actual source content so the Content Grounding Analyzer can check whether the source really says what the claim asserts.

 

Without the CEE, the CRS has nothing to resolve. Without the CRS, the CGA has nothing to analyze. The pipeline is sequential and each module's output is the next module's input. The CEE is the foundation that makes the entire verification system possible.

 

In the next post in this series, we'll dive into the Citation Resolution Service — the module that goes from a citation reference to actual retrievable source text, handling PDFs, paywalls, dead links, and everything in between.

 

Every claim. Every report. Every time. [See verification in action →](/register)