Source Tracing Benchmark
Evaluates whether AI models can find and cite authoritative, live, domain-appropriate sources — measured by URL validity, RSI source authority scoring, freshness, and citation alignment.
What It Measures#
The Source Tracing benchmark tests a specific, high-value capability: given a domain-specific research question, can the model find and cite sources that are (1) real and reachable, (2) authoritative in the relevant domain, (3) recent, and (4) actually aligned with the claims made?
Many models can produce plausible-looking URLs. Most of those URLs are fabricated, dead, or point to irrelevant pages. Source Tracing validates every URL in the response and applies a structured authority scoring framework — the RSI (Reference Source Index) — to measure genuine citation quality.
Live URL Validation
Every extracted URL is sent a HEAD request to verify it resolves to a real, reachable page.
RSI Authority Scoring
Valid URLs are scored using the Reference Source Index — a composite of domain authority, source type, and topical relevance.
Freshness Weighting
Recency multipliers penalize outdated sources and reward recent publications.
Citation Alignment
An LLM judge scores whether citations actually support the claims made in the response (0–100).
Test Domains#
Prompts are drawn from five domain categories, each requiring domain-appropriate authoritative sources. Models that cite Wikipedia for a clinical trial result or a blog post for a regulatory filing score lower than models that cite primary sources.
| Domain | Expected Source Types | Example Prompt Focus |
|---|---|---|
| Legal | Court opinions, statutes, regulatory filings, official government publications | Case law, regulatory compliance, legal precedent |
| Scientific | PubMed, Nature, The Lancet, peer-reviewed journals, preprint servers | Clinical trials, research findings, meta-analyses |
| Financial | SEC filings, earnings reports, Bloomberg, Reuters, FT, WSJ | Market data, company financials, economic indicators |
| Medical | WHO, CDC, NIH, clinical practice guidelines, medical journals | Treatment protocols, epidemiology, drug approvals |
| Policy | Congressional records, UN documents, government white papers, think tanks | Legislation, international agreements, policy analysis |
Scoring Pipeline#
Each benchmark run follows this sequence for a single model + domain prompt combination:
LLM Call with Citation Instruction
URL Extraction
URL Validation
RSI Scoring
Freshness Calculation
Source Authority Score
Citation Alignment Judgment
The RSI Score#
The Reference Source Index (RSI) is a composite authority metric that evaluates how credible and appropriate a source is for the domain being tested. RSI values fall on a [0.25, 1.5] scale where 0.25 represents the lowest quality sources and 1.5 represents the most authoritative.
| RSI Component | What It Measures | Score Contribution |
|---|---|---|
| Authority Score | Domain-level reputation: .gov, .edu, peer-reviewed publishers, major news organizations score higher than blogs or forums. | Primary driver (largest weight) |
| Source Type | Tier classification: Primary sources (GOVERNMENT, ACADEMIC_JOURNAL, COURT) > Secondary (NEWS_MAJOR, THINK_TANK) > Tertiary (BLOG, FORUM). | Categorical multiplier |
| Topical Relevance | How relevant is this source type to the domain being tested? A medical journal scores higher for a medical prompt than for a financial one. | Domain alignment bonus |
| Recency Multiplier | Freshness adjustment based on source type. Recent publications from journals score near 1.0. Outdated or undated sources receive a recency penalty. | Applied after composite calculation |
RSI Normalization
Source Authority Score Formula#
The Source Authority Score is the primary leaderboard metric for this benchmark. It combines RSI quality (how authoritative the valid sources are) with URL validity rate (how many of the cited URLs actually resolve).
Formula
Where:
• Normalized RSI = average RSI composite across valid URLs, normalized to 0–100
• URL Validity Rate = valid URL count ÷ total extracted URL count (0.0–1.0)
Example: Model cites 8 URLs, 6 resolve (validity rate = 0.75). Valid URLs average RSI composite of 1.1 → Normalized RSI = ((1.1 − 0.25) ÷ 1.25) × 100 = 68. Source Authority Score = (68 × 0.7) + (0.75 × 100 × 0.3) = 47.6 + 22.5 = 70.
Citation Alignment Judge#
After URL validation, a citation alignment judge evaluates whether the model's citations actually support the specific claims made in the response. A model can cite six authoritative, valid URLs that are completely unrelated to its claims — the alignment judge catches this.
| Score Range | Meaning |
|---|---|
| 0–20 | Citations absent, irrelevant, or clearly fabricated |
| 21–40 | Citations exist but don't support the specific claims made |
| 41–60 | Citations partially support some claims |
| 61–80 | Most citations clearly support specific claims |
| 81–100 | All citations precisely support claims, authoritative, and properly contextualized |
Alignment vs Authority
All Metrics Explained#
| Metric | Definition | Range |
|---|---|---|
| Source Authority Score | Blended RSI quality + URL validity (70/30 blend). Primary leaderboard metric. | 0–100 |
| URL Validity Rate | Fraction of extracted URLs that successfully resolved to a live page. | 0.0–1.0 |
| Source Freshness | Average recency multiplier across valid URLs. Higher = more current sources. | 0.0–1.0 |
| Citation Alignment Score | LLM judge score: how well citations support the specific claims made. | 0–100 |
| Total Citations | Number of URLs extracted from the response. | Count |
| Valid Citations | Number of URLs that passed HEAD validation. | Count |
Learning Signals#
The benchmark emits BenchmarkLearningSignal records for failures:
LOW_URL_VALIDITY — fired when urlValidityRate < 0.70. Indicates the model is fabricating or recycling dead URLs. The learning loop uses these signals to decrement the model's routing weight in the relevant domain.
LOW_AUTHORITY_SCORE — fired when sourceAuthorityScore < 40. Indicates the model is citing low-quality sources. The learning loop uses a spike in these signals to raise thegrounding_threshold in VerificationConfig — requiring stronger evidence before accepting a source citation in live reports.
Reading the Leaderboard#
What Good Looks Like
- Source Authority Score > 70: Model consistently cites authoritative, live sources. Reliable for domain research.
- URL Validity Rate > 0.80: 80%+ of cited URLs resolve. Low fabrication rate.
- Citation Alignment > 70: Most citations actually support the claims made.
- URL Validity < 0.50: More than half of cited URLs are dead or fabricated. Avoid for source-critical research.
- Compare Authority Score vs Alignment: A gap means the model finds authoritative sources but doesn't use them correctly.