Source Tracing Benchmark

Evaluates whether AI models can find and cite authoritative, live, domain-appropriate sources — measured by URL validity, RSI source authority scoring, freshness, and citation alignment.

What It Measures#

The Source Tracing benchmark tests a specific, high-value capability: given a domain-specific research question, can the model find and cite sources that are (1) real and reachable, (2) authoritative in the relevant domain, (3) recent, and (4) actually aligned with the claims made?

Many models can produce plausible-looking URLs. Most of those URLs are fabricated, dead, or point to irrelevant pages. Source Tracing validates every URL in the response and applies a structured authority scoring framework — the RSI (Reference Source Index) — to measure genuine citation quality.

Live URL Validation

Every extracted URL is sent a HEAD request to verify it resolves to a real, reachable page.

RSI Authority Scoring

Valid URLs are scored using the Reference Source Index — a composite of domain authority, source type, and topical relevance.

Freshness Weighting

Recency multipliers penalize outdated sources and reward recent publications.

Citation Alignment

An LLM judge scores whether citations actually support the claims made in the response (0–100).

Test Domains#

Prompts are drawn from five domain categories, each requiring domain-appropriate authoritative sources. Models that cite Wikipedia for a clinical trial result or a blog post for a regulatory filing score lower than models that cite primary sources.

Domain	Expected Source Types	Example Prompt Focus
Legal	Court opinions, statutes, regulatory filings, official government publications	Case law, regulatory compliance, legal precedent
Scientific	PubMed, Nature, The Lancet, peer-reviewed journals, preprint servers	Clinical trials, research findings, meta-analyses
Financial	SEC filings, earnings reports, Bloomberg, Reuters, FT, WSJ	Market data, company financials, economic indicators
Medical	WHO, CDC, NIH, clinical practice guidelines, medical journals	Treatment protocols, epidemiology, drug approvals
Policy	Congressional records, UN documents, government white papers, think tanks	Legislation, international agreements, policy analysis

Scoring Pipeline#

Each benchmark run follows this sequence for a single model + domain prompt combination:

LLM Call with Citation Instruction

The model receives the domain prompt with an explicit instruction: cite 3–6 authoritative sources with full URLs inline in the response. Temperature is set to 0.3 for consistent citation behavior. Max tokens: 2,000.

URL Extraction

All URLs matching the pattern https?://... are extracted from the response using a strict regex. Trailing punctuation (commas, periods, parentheses) is stripped. Duplicates are removed. Up to 15 URLs are evaluated.

URL Validation

Each URL receives a HEAD request with a 10-second timeout. Results are classified as "valid" (2xx/3xx response), "invalid" (4xx/5xx), or "unreachable" (timeout or DNS failure). Up to 15 URLs are validated in parallel.

RSI Scoring

For every valid URL, the RSI (Reference Source Index) is computed. RSI is a composite score on a [0.25, 1.5] scale based on domain authority type, source category, and topical relevance.

Freshness Calculation

A recency multiplier is calculated for each valid source based on its source type. Academic journals, government data, and legal databases receive different baseline freshness curves.

Source Authority Score

The final Source Authority Score (0–100) is computed as a blend: 70% normalized RSI + 30% URL validity rate.

Citation Alignment Judgment

A lightweight LLM judge receives the original prompt, the first 2,000 characters of the response, and up to 10 valid URLs. The judge scores citation alignment on a 0–100 scale.

The RSI Score#

The Reference Source Index (RSI) is a composite authority metric that evaluates how credible and appropriate a source is for the domain being tested. RSI values fall on a [0.25, 1.5] scale where 0.25 represents the lowest quality sources and 1.5 represents the most authoritative.

RSI Component	What It Measures	Score Contribution
Authority Score	Domain-level reputation: .gov, .edu, peer-reviewed publishers, major news organizations score higher than blogs or forums.	Primary driver (largest weight)
Source Type	Tier classification: Primary sources (GOVERNMENT, ACADEMIC_JOURNAL, COURT) > Secondary (NEWS_MAJOR, THINK_TANK) > Tertiary (BLOG, FORUM).	Categorical multiplier
Topical Relevance	How relevant is this source type to the domain being tested? A medical journal scores higher for a medical prompt than for a financial one.	Domain alignment bonus
Recency Multiplier	Freshness adjustment based on source type. Recent publications from journals score near 1.0. Outdated or undated sources receive a recency penalty.	Applied after composite calculation

RSI Normalization

Raw RSI composite values are on the [0.25, 1.5] scale (the minimum and maximum achievable scores). These are normalized to [0, 100] using: Normalized RSI = ((RSI − 0.25) ÷ (1.5 − 0.25)) × 100

Source Authority Score Formula#

The Source Authority Score is the primary leaderboard metric for this benchmark. It combines RSI quality (how authoritative the valid sources are) with URL validity rate (how many of the cited URLs actually resolve).

Formula

Source Authority Score = round(Normalized RSI × 0.70 + URL Validity Rate × 100 × 0.30)

Where:
• Normalized RSI = average RSI composite across valid URLs, normalized to 0–100
• URL Validity Rate = valid URL count ÷ total extracted URL count (0.0–1.0)

Example: Model cites 8 URLs, 6 resolve (validity rate = 0.75). Valid URLs average RSI composite of 1.1 → Normalized RSI = ((1.1 − 0.25) ÷ 1.25) × 100 = 68. Source Authority Score = (68 × 0.7) + (0.75 × 100 × 0.3) = 47.6 + 22.5 = 70.

Citation Alignment Judge#

After URL validation, a citation alignment judge evaluates whether the model's citations actually support the specific claims made in the response. A model can cite six authoritative, valid URLs that are completely unrelated to its claims — the alignment judge catches this.

Score Range	Meaning
0–20	Citations absent, irrelevant, or clearly fabricated
21–40	Citations exist but don't support the specific claims made
41–60	Citations partially support some claims
61–80	Most citations clearly support specific claims
81–100	All citations precisely support claims, authoritative, and properly contextualized

Alignment vs Authority

A model can score high on Source Authority (citing .gov and .edu URLs) but low on Alignment (if those URLs don't actually support the claims made). The best models score high on both.

All Metrics Explained#

Metric	Definition	Range
Source Authority Score	Blended RSI quality + URL validity (70/30 blend). Primary leaderboard metric.	0–100
URL Validity Rate	Fraction of extracted URLs that successfully resolved to a live page.	0.0–1.0
Source Freshness	Average recency multiplier across valid URLs. Higher = more current sources.	0.0–1.0
Citation Alignment Score	LLM judge score: how well citations support the specific claims made.	0–100
Total Citations	Number of URLs extracted from the response.	Count
Valid Citations	Number of URLs that passed HEAD validation.	Count

Learning Signals#

The benchmark emits BenchmarkLearningSignal records for failures:

LOW_URL_VALIDITY — fired when urlValidityRate < 0.70. Indicates the model is fabricating or recycling dead URLs. The learning loop uses these signals to decrement the model's routing weight in the relevant domain.

LOW_AUTHORITY_SCORE — fired when sourceAuthorityScore < 40. Indicates the model is citing low-quality sources. The learning loop uses a spike in these signals to raise thegrounding_threshold in VerificationConfig — requiring stronger evidence before accepting a source citation in live reports.

Reading the Leaderboard#

What Good Looks Like

Source Authority Score > 70: Model consistently cites authoritative, live sources. Reliable for domain research.
URL Validity Rate > 0.80: 80%+ of cited URLs resolve. Low fabrication rate.
Citation Alignment > 70: Most citations actually support the claims made.
URL Validity < 0.50: More than half of cited URLs are dead or fabricated. Avoid for source-critical research.
Compare Authority Score vs Alignment: A gap means the model finds authoritative sources but doesn't use them correctly.

View the live Source Tracing leaderboard →

PreviousClaim Verification

NextHallucination Detection