Skip to main content

Source Tracing Benchmark

Evaluates whether AI models can find and cite authoritative, live, domain-appropriate sources — measured by URL validity, RSI source authority scoring, freshness, and citation alignment.

What It Measures#

The Source Tracing benchmark tests a specific, high-value capability: given a domain-specific research question, can the model find and cite sources that are (1) real and reachable, (2) authoritative in the relevant domain, (3) recent, and (4) actually aligned with the claims made?

Many models can produce plausible-looking URLs. Most of those URLs are fabricated, dead, or point to irrelevant pages. Source Tracing validates every URL in the response and applies a structured authority scoring framework — the RSI (Reference Source Index) — to measure genuine citation quality.

Live URL Validation

Every extracted URL is sent a HEAD request to verify it resolves to a real, reachable page.

RSI Authority Scoring

Valid URLs are scored using the Reference Source Index — a composite of domain authority, source type, and topical relevance.

Freshness Weighting

Recency multipliers penalize outdated sources and reward recent publications.

Citation Alignment

An LLM judge scores whether citations actually support the claims made in the response (0–100).

Test Domains#

Prompts are drawn from five domain categories, each requiring domain-appropriate authoritative sources. Models that cite Wikipedia for a clinical trial result or a blog post for a regulatory filing score lower than models that cite primary sources.

DomainExpected Source TypesExample Prompt Focus
LegalCourt opinions, statutes, regulatory filings, official government publicationsCase law, regulatory compliance, legal precedent
ScientificPubMed, Nature, The Lancet, peer-reviewed journals, preprint serversClinical trials, research findings, meta-analyses
FinancialSEC filings, earnings reports, Bloomberg, Reuters, FT, WSJMarket data, company financials, economic indicators
MedicalWHO, CDC, NIH, clinical practice guidelines, medical journalsTreatment protocols, epidemiology, drug approvals
PolicyCongressional records, UN documents, government white papers, think tanksLegislation, international agreements, policy analysis

Scoring Pipeline#

Each benchmark run follows this sequence for a single model + domain prompt combination:

1

LLM Call with Citation Instruction

The model receives the domain prompt with an explicit instruction: cite 3–6 authoritative sources with full URLs inline in the response. Temperature is set to 0.3 for consistent citation behavior. Max tokens: 2,000.
2

URL Extraction

All URLs matching the pattern https?://... are extracted from the response using a strict regex. Trailing punctuation (commas, periods, parentheses) is stripped. Duplicates are removed. Up to 15 URLs are evaluated.
3

URL Validation

Each URL receives a HEAD request with a 10-second timeout. Results are classified as "valid" (2xx/3xx response), "invalid" (4xx/5xx), or "unreachable" (timeout or DNS failure). Up to 15 URLs are validated in parallel.
4

RSI Scoring

For every valid URL, the RSI (Reference Source Index) is computed. RSI is a composite score on a [0.25, 1.5] scale based on domain authority type, source category, and topical relevance.
5

Freshness Calculation

A recency multiplier is calculated for each valid source based on its source type. Academic journals, government data, and legal databases receive different baseline freshness curves.
6

Source Authority Score

The final Source Authority Score (0–100) is computed as a blend: 70% normalized RSI + 30% URL validity rate.
7

Citation Alignment Judgment

A lightweight LLM judge receives the original prompt, the first 2,000 characters of the response, and up to 10 valid URLs. The judge scores citation alignment on a 0–100 scale.

The RSI Score#

The Reference Source Index (RSI) is a composite authority metric that evaluates how credible and appropriate a source is for the domain being tested. RSI values fall on a [0.25, 1.5] scale where 0.25 represents the lowest quality sources and 1.5 represents the most authoritative.

RSI ComponentWhat It MeasuresScore Contribution
Authority ScoreDomain-level reputation: .gov, .edu, peer-reviewed publishers, major news organizations score higher than blogs or forums.Primary driver (largest weight)
Source TypeTier classification: Primary sources (GOVERNMENT, ACADEMIC_JOURNAL, COURT) > Secondary (NEWS_MAJOR, THINK_TANK) > Tertiary (BLOG, FORUM).Categorical multiplier
Topical RelevanceHow relevant is this source type to the domain being tested? A medical journal scores higher for a medical prompt than for a financial one.Domain alignment bonus
Recency MultiplierFreshness adjustment based on source type. Recent publications from journals score near 1.0. Outdated or undated sources receive a recency penalty.Applied after composite calculation

Source Authority Score Formula#

The Source Authority Score is the primary leaderboard metric for this benchmark. It combines RSI quality (how authoritative the valid sources are) with URL validity rate (how many of the cited URLs actually resolve).

Citation Alignment Judge#

After URL validation, a citation alignment judge evaluates whether the model's citations actually support the specific claims made in the response. A model can cite six authoritative, valid URLs that are completely unrelated to its claims — the alignment judge catches this.

Score RangeMeaning
0–20Citations absent, irrelevant, or clearly fabricated
21–40Citations exist but don't support the specific claims made
41–60Citations partially support some claims
61–80Most citations clearly support specific claims
81–100All citations precisely support claims, authoritative, and properly contextualized

All Metrics Explained#

MetricDefinitionRange
Source Authority ScoreBlended RSI quality + URL validity (70/30 blend). Primary leaderboard metric.0–100
URL Validity RateFraction of extracted URLs that successfully resolved to a live page.0.0–1.0
Source FreshnessAverage recency multiplier across valid URLs. Higher = more current sources.0.0–1.0
Citation Alignment ScoreLLM judge score: how well citations support the specific claims made.0–100
Total CitationsNumber of URLs extracted from the response.Count
Valid CitationsNumber of URLs that passed HEAD validation.Count

Learning Signals#

The benchmark emits BenchmarkLearningSignal records for failures:

LOW_URL_VALIDITY — fired when urlValidityRate < 0.70. Indicates the model is fabricating or recycling dead URLs. The learning loop uses these signals to decrement the model's routing weight in the relevant domain.

LOW_AUTHORITY_SCORE — fired when sourceAuthorityScore < 40. Indicates the model is citing low-quality sources. The learning loop uses a spike in these signals to raise thegrounding_threshold in VerificationConfig — requiring stronger evidence before accepting a source citation in live reports.

Reading the Leaderboard#

View the live Source Tracing leaderboard →