General Intelligence Benchmark
Daily evaluation of AI model quality across 50 curated prompts — knowledge, reasoning, mathematics, coding, analysis, and creativity — judged by an LLM panel and scored on five dimensions.
What It Measures#
The General Intelligence benchmark answers one question: how well does a model respond to a broad, open-ended research or reasoning prompt? It does not test citation grounding or hallucination detection — those are handled by the Claim Verification and Hallucination Detection suites. It tests the model's raw intelligence: depth of reasoning, factual accuracy, coherent structure, and appropriate completeness.
50 Prompts
Comprehensive prompt library covering 9 categories across multiple difficulty levels.
Daily Rotation
One prompt is selected each day using a Fisher-Yates shuffle across a 50-day cycle.
Panel Scoring
Responses are judged by an LLM or multi-judge panel with averaged scores.
The 50-Prompt Library#
The prompt library is organized into 9 categories, each designed to test a distinct capability. Prompts are shuffled and rotated so no model is tested on the same prompt twice within a 50-day cycle.
| Category | What It Tests | Example Difficulty |
|---|---|---|
| Knowledge & Recall | Factual accuracy across domains — science, history, geography | Medium |
| Reasoning & Logic | Step-by-step deduction, causal chains, logical puzzles | Medium–Hard |
| Mathematics | Numerical computation, algebra, statistics, probability | Medium–Hard |
| Coding | Code generation, debugging, algorithm design | Medium–Hard |
| Analysis & Synthesis | Comparing multiple perspectives, summarizing complex topics | Hard |
| Creativity & Writing | Open-ended composition, narrative, persuasion | Medium |
| Instruction Following | Multi-step instructions with specific formatting constraints | Medium |
| Data Interpretation | Reading and reasoning over structured data or charts | Hard |
| Edge Cases | Ambiguous, trick, or adversarial prompts | Hard |
Prompt Selection
Scoring Pipeline#
At midnight GMT, the benchmark runs the following steps for every active model:
Prompt Selection
Model Execution
LLM Judge Evaluation
Pairwise Tournament (optional)
Rolling Averages
LLM-as-Judge Rubric#
The judge evaluates each response against the following rubric. Each dimension is scored on a 1–5 scale. The overall quality score is the weighted average of all five dimensions, normalized to 0–100.
| Dimension | Weight | What the Judge Looks For | 5 = Excellent |
|---|---|---|---|
| Relevance | 25% | Does the response directly and completely address the prompt? | Fully on-topic, no tangents, every sentence adds value. |
| Accuracy | 30% | Are the factual claims correct? No fabrications or misattributions. | All statements verifiable; no invented statistics, names, or dates. |
| Coherence | 20% | Is the response logically structured and easy to follow? | Clear flow, ideas connect, no contradictions or non-sequiturs. |
| Completeness | 15% | Does the response cover all key aspects the prompt requires? | Nothing important omitted; depth appropriate to difficulty level. |
| Depth | 10% | Does the response demonstrate genuine understanding beyond surface-level facts? | Nuanced analysis, original synthesis, or expert-level reasoning present. |
Quality Score Formula
Multi-Judge Panel#
When a JudgePanelConfig is active in the system, responses are evaluated by multiple judge models simultaneously. Each judge independently scores the response on the same rubric. The final score is the mean of all judge scores, which reduces individual model bias and produces more stable rankings.
| Mode | How It Works | When Used |
|---|---|---|
| Single Judge | One judge model evaluates all responses. Fast and consistent within a day. | Default when no panel config is active. |
| Multi-Judge Panel | Two or three judge models each score independently. Scores are averaged. | When a JudgePanelConfig is configured in admin settings. |
| Pairwise Tournament | Responses are compared head-to-head (A vs B). Win rates produce ELO-style rankings. | When usePairwise is enabled in the panel config (up to 3 judge models). |
Metrics Explained#
The General Intelligence leaderboard surfaces five metrics per model:
| Metric | Definition | Why It Matters |
|---|---|---|
| Quality Score | Weighted average of the 5-dimension rubric, normalized to 0–100. | Primary signal for response quality. Higher = better responses. |
| Latency (ms) | Time from request sent to first response token received. | Critical for streaming and interactive applications. |
| Tokens / Second | Output generation speed after first token. | Determines how quickly a full response is delivered. |
| Cost per 1K | Blended input + output cost per 1,000 tokens at OpenRouter pricing. | Enables cost-adjusted quality comparison across models. |
| Sample Count | Number of daily benchmarks included in the rolling average. | Higher counts = more statistically reliable averages. |
Rolling Averages#
Every time a new benchmark completes, rolling averages are recomputed for all 8 time windows. Rolling averages smooth out single-day outliers (e.g. provider API issues) and reveal genuine trends in model quality over time. A model with a consistently high 30-day average is more reliable than one with a high single-day score.
Reading the Leaderboard#
Best Practices
- Start with the 30-day view for a stable ranking that reflects recent model updates.
- Use the 1-day view to see yesterday's specific prompt and compare responses directly.
- A high Quality Score combined with low cost/1K indicates the best value model.
- Models with fewer than 5 samples in a period have unreliable averages — treat with caution.
- Watch for sudden drops in the 7-day view — these often signal a provider update.