General Intelligence Benchmark

Daily evaluation of AI model quality across 50 curated prompts — knowledge, reasoning, mathematics, coding, analysis, and creativity — judged by an LLM panel and scored on five dimensions.

What It Measures#

The General Intelligence benchmark answers one question: how well does a model respond to a broad, open-ended research or reasoning prompt? It does not test citation grounding or hallucination detection — those are handled by the Claim Verification and Hallucination Detection suites. It tests the model's raw intelligence: depth of reasoning, factual accuracy, coherent structure, and appropriate completeness.

50 Prompts

Comprehensive prompt library covering 9 categories across multiple difficulty levels.

Daily Rotation

One prompt is selected each day using a Fisher-Yates shuffle across a 50-day cycle.

Panel Scoring

Responses are judged by an LLM or multi-judge panel with averaged scores.

The 50-Prompt Library#

The prompt library is organized into 9 categories, each designed to test a distinct capability. Prompts are shuffled and rotated so no model is tested on the same prompt twice within a 50-day cycle.

Category	What It Tests	Example Difficulty
Knowledge & Recall	Factual accuracy across domains — science, history, geography	Medium
Reasoning & Logic	Step-by-step deduction, causal chains, logical puzzles	Medium–Hard
Mathematics	Numerical computation, algebra, statistics, probability	Medium–Hard
Coding	Code generation, debugging, algorithm design	Medium–Hard
Analysis & Synthesis	Comparing multiple perspectives, summarizing complex topics	Hard
Creativity & Writing	Open-ended composition, narrative, persuasion	Medium
Instruction Following	Multi-step instructions with specific formatting constraints	Medium
Data Interpretation	Reading and reasoning over structured data or charts	Hard
Edge Cases	Ambiguous, trick, or adversarial prompts	Hard

Prompt Selection

The daily prompt is selected using a Fisher-Yates shuffle applied to the full 50-prompt library. Every prompt appears exactly once per 50-day cycle before the cycle resets. This ensures that no category or difficulty level dominates short-term averages.

Scoring Pipeline#

At midnight GMT, the benchmark runs the following steps for every active model:

Prompt Selection

The day's prompt is selected from the rotating cycle. The same prompt text is sent to every model with identical parameters (temperature, max tokens).

Model Execution

Each model receives the prompt via OpenRouter. Response text, latency to first token, total tokens, and cost are recorded.

LLM Judge Evaluation

The response is passed to a judge model (or panel) with the prompt, category, and difficulty level. The judge scores the response on five dimensions.

Pairwise Tournament (optional)

When a multi-judge panel is active, responses are also compared head-to-head in a pairwise tournament to produce ELO-style rankings.

Rolling Averages

Scores are aggregated into rolling averages across 8 time windows (1d, 7d, 30d, 60d, 90d, 180d, 365d, all-time) and the leaderboard is updated.

LLM-as-Judge Rubric#

The judge evaluates each response against the following rubric. Each dimension is scored on a 1–5 scale. The overall quality score is the weighted average of all five dimensions, normalized to 0–100.

Dimension	Weight	What the Judge Looks For	5 = Excellent
Relevance	25%	Does the response directly and completely address the prompt?	Fully on-topic, no tangents, every sentence adds value.
Accuracy	30%	Are the factual claims correct? No fabrications or misattributions.	All statements verifiable; no invented statistics, names, or dates.
Coherence	20%	Is the response logically structured and easy to follow?	Clear flow, ideas connect, no contradictions or non-sequiturs.
Completeness	15%	Does the response cover all key aspects the prompt requires?	Nothing important omitted; depth appropriate to difficulty level.
Depth	10%	Does the response demonstrate genuine understanding beyond surface-level facts?	Nuanced analysis, original synthesis, or expert-level reasoning present.

Quality Score Formula

Quality Score (0–100) = ((Relevance × 0.25) + (Accuracy × 0.30) + (Coherence × 0.20) + (Completeness × 0.15) + (Depth × 0.10)) × 20

Multi-Judge Panel#

When a JudgePanelConfig is active in the system, responses are evaluated by multiple judge models simultaneously. Each judge independently scores the response on the same rubric. The final score is the mean of all judge scores, which reduces individual model bias and produces more stable rankings.

Mode	How It Works	When Used
Single Judge	One judge model evaluates all responses. Fast and consistent within a day.	Default when no panel config is active.
Multi-Judge Panel	Two or three judge models each score independently. Scores are averaged.	When a JudgePanelConfig is configured in admin settings.
Pairwise Tournament	Responses are compared head-to-head (A vs B). Win rates produce ELO-style rankings.	When usePairwise is enabled in the panel config (up to 3 judge models).

Metrics Explained#

The General Intelligence leaderboard surfaces five metrics per model:

Metric	Definition	Why It Matters
Quality Score	Weighted average of the 5-dimension rubric, normalized to 0–100.	Primary signal for response quality. Higher = better responses.
Latency (ms)	Time from request sent to first response token received.	Critical for streaming and interactive applications.
Tokens / Second	Output generation speed after first token.	Determines how quickly a full response is delivered.
Cost per 1K	Blended input + output cost per 1,000 tokens at OpenRouter pricing.	Enables cost-adjusted quality comparison across models.
Sample Count	Number of daily benchmarks included in the rolling average.	Higher counts = more statistically reliable averages.

Rolling Averages#

Every time a new benchmark completes, rolling averages are recomputed for all 8 time windows. Rolling averages smooth out single-day outliers (e.g. provider API issues) and reveal genuine trends in model quality over time. A model with a consistently high 30-day average is more reliable than one with a high single-day score.

Reading the Leaderboard#

Best Practices

Start with the 30-day view for a stable ranking that reflects recent model updates.
Use the 1-day view to see yesterday's specific prompt and compare responses directly.
A high Quality Score combined with low cost/1K indicates the best value model.
Models with fewer than 5 samples in a period have unreliable averages — treat with caution.
Watch for sudden drops in the 7-day view — these often signal a provider update.

View the live General Intelligence leaderboard →

PreviousBenchmarks Overview

NextClaim Verification