Skip to main content

General Intelligence Benchmark

Daily evaluation of AI model quality across 50 curated prompts — knowledge, reasoning, mathematics, coding, analysis, and creativity — judged by an LLM panel and scored on five dimensions.

What It Measures#

The General Intelligence benchmark answers one question: how well does a model respond to a broad, open-ended research or reasoning prompt? It does not test citation grounding or hallucination detection — those are handled by the Claim Verification and Hallucination Detection suites. It tests the model's raw intelligence: depth of reasoning, factual accuracy, coherent structure, and appropriate completeness.

50 Prompts

Comprehensive prompt library covering 9 categories across multiple difficulty levels.

Daily Rotation

One prompt is selected each day using a Fisher-Yates shuffle across a 50-day cycle.

Panel Scoring

Responses are judged by an LLM or multi-judge panel with averaged scores.

The 50-Prompt Library#

The prompt library is organized into 9 categories, each designed to test a distinct capability. Prompts are shuffled and rotated so no model is tested on the same prompt twice within a 50-day cycle.

CategoryWhat It TestsExample Difficulty
Knowledge & RecallFactual accuracy across domains — science, history, geographyMedium
Reasoning & LogicStep-by-step deduction, causal chains, logical puzzlesMedium–Hard
MathematicsNumerical computation, algebra, statistics, probabilityMedium–Hard
CodingCode generation, debugging, algorithm designMedium–Hard
Analysis & SynthesisComparing multiple perspectives, summarizing complex topicsHard
Creativity & WritingOpen-ended composition, narrative, persuasionMedium
Instruction FollowingMulti-step instructions with specific formatting constraintsMedium
Data InterpretationReading and reasoning over structured data or chartsHard
Edge CasesAmbiguous, trick, or adversarial promptsHard

Scoring Pipeline#

At midnight GMT, the benchmark runs the following steps for every active model:

1

Prompt Selection

The day's prompt is selected from the rotating cycle. The same prompt text is sent to every model with identical parameters (temperature, max tokens).
2

Model Execution

Each model receives the prompt via OpenRouter. Response text, latency to first token, total tokens, and cost are recorded.
3

LLM Judge Evaluation

The response is passed to a judge model (or panel) with the prompt, category, and difficulty level. The judge scores the response on five dimensions.
4

Pairwise Tournament (optional)

When a multi-judge panel is active, responses are also compared head-to-head in a pairwise tournament to produce ELO-style rankings.
5

Rolling Averages

Scores are aggregated into rolling averages across 8 time windows (1d, 7d, 30d, 60d, 90d, 180d, 365d, all-time) and the leaderboard is updated.

LLM-as-Judge Rubric#

The judge evaluates each response against the following rubric. Each dimension is scored on a 1–5 scale. The overall quality score is the weighted average of all five dimensions, normalized to 0–100.

DimensionWeightWhat the Judge Looks For5 = Excellent
Relevance25%Does the response directly and completely address the prompt?Fully on-topic, no tangents, every sentence adds value.
Accuracy30%Are the factual claims correct? No fabrications or misattributions.All statements verifiable; no invented statistics, names, or dates.
Coherence20%Is the response logically structured and easy to follow?Clear flow, ideas connect, no contradictions or non-sequiturs.
Completeness15%Does the response cover all key aspects the prompt requires?Nothing important omitted; depth appropriate to difficulty level.
Depth10%Does the response demonstrate genuine understanding beyond surface-level facts?Nuanced analysis, original synthesis, or expert-level reasoning present.

Multi-Judge Panel#

When a JudgePanelConfig is active in the system, responses are evaluated by multiple judge models simultaneously. Each judge independently scores the response on the same rubric. The final score is the mean of all judge scores, which reduces individual model bias and produces more stable rankings.

ModeHow It WorksWhen Used
Single JudgeOne judge model evaluates all responses. Fast and consistent within a day.Default when no panel config is active.
Multi-Judge PanelTwo or three judge models each score independently. Scores are averaged.When a JudgePanelConfig is configured in admin settings.
Pairwise TournamentResponses are compared head-to-head (A vs B). Win rates produce ELO-style rankings.When usePairwise is enabled in the panel config (up to 3 judge models).

Metrics Explained#

The General Intelligence leaderboard surfaces five metrics per model:

MetricDefinitionWhy It Matters
Quality ScoreWeighted average of the 5-dimension rubric, normalized to 0–100.Primary signal for response quality. Higher = better responses.
Latency (ms)Time from request sent to first response token received.Critical for streaming and interactive applications.
Tokens / SecondOutput generation speed after first token.Determines how quickly a full response is delivered.
Cost per 1KBlended input + output cost per 1,000 tokens at OpenRouter pricing.Enables cost-adjusted quality comparison across models.
Sample CountNumber of daily benchmarks included in the rolling average.Higher counts = more statistically reliable averages.

Rolling Averages#

Every time a new benchmark completes, rolling averages are recomputed for all 8 time windows. Rolling averages smooth out single-day outliers (e.g. provider API issues) and reveal genuine trends in model quality over time. A model with a consistently high 30-day average is more reliable than one with a high single-day score.

Reading the Leaderboard#

View the live General Intelligence leaderboard →