Writing Quality Benchmark
Measures AI model prose quality using the stop-slop detector — PromptReports.ai's production filter that identifies writing patterns that erode trust and readability in research outputs.
What It Measures#
The Writing Quality benchmark tests whether AI-generated text reads like analyst-grade prose or generic AI output. It runs 50 fixed text samples through the stop-slop scoring engine — the same system PromptReports.ai applies to every report section before delivery. Samples include intentionally bad prose across 7 violation categories and clean analyst-style controls. Models pass when their output avoids these patterns and scores above 70/100.
50 Fixed Prompts
The same 50 text samples run every time — split between intentionally bad examples and clean controls. No model sees different prompts.
Consistent Scoring
The stop-slop detector is deterministic — same input always produces the same score. No LLM judge variance.
5 Dimensions
Directness, Rhythm, Trust, Authenticity, and Information Density — each scored separately.
Pass/Fail by Category
Each of the 50 prompts has a known correct answer (should pass or should fail). Accuracy is measured against ground truth.
Why Prose Quality Matters#
Research reports are only as valuable as the trust readers place in them. AI writing patterns like excessive hedging ("might arguably suggest"), filler phrases ("it is important to note that"), and generic adjectives ("significant", "notable", "key") are signals that the content was generated by a machine following a template — not by an expert who knows what they are talking about. These patterns reduce perceived credibility even when the underlying analysis is correct.
PromptReports.ai applies the stop-slop filter to every report section. Models that consistently fail Writing Quality benchmarks produce reports that need more editorial cleanup before delivery. This benchmark identifies which models generate the cleanest prose out of the box.
The 50-Prompt Test Set#
The prompt set is fixed — every benchmark run uses the same 50 samples. Samples are labeled with a ground truth: shouldPass: true (clean text that should score ≥ 70) orshouldPass: false (text with deliberate violations that should score below 70).
| Category | Count | Ground Truth | What It Contains |
|---|---|---|---|
| Hedging violations | 5 | Should FAIL | "arguably", "might", "could potentially", "seems to suggest" |
| Filler phrases | 5 | Should FAIL | "It is important to note", "at the end of the day", "in today's landscape" |
| Generic adjectives | 3 | Should FAIL | "significant", "notable", "key", "groundbreaking", "major", "important" |
| Passive voice | 3 | Should FAIL | "was published by", "were confirmed by", "has been validated" |
| WH-openers | 3 | Should FAIL | Sentences starting with What/Where/When/Why/How/Who as rhetorical devices |
| Binary contrasts | 2 | Should FAIL | "Not a setback. A strategic pivot." sentence fragment pairs |
| Em-dash overuse | 2 | Should FAIL | Three or more em-dashes per sentence or paragraph |
| Mixed violations | 3 | Should FAIL | Multiple violation types combined in one sample |
| Analyst clean | 5 | Should PASS | Data-driven, direct, specific — Goldman upgrades, Fed decisions, M&A multiples |
| Narrative clean | 4 | Should PASS | Event-driven narrative without hedging — IPO prices, leadership changes |
| Edge cases | 3 | Should PASS | Short, numeric, or list-format text with no violations |
| Clean controls | 10 | Should PASS | Specific, quantified, analyst-grade sentences with no AI slop patterns |
| TOTAL | 50 | — |
The 7 Violation Categories#
The stop-slop detector scans for seven categories of problematic AI writing patterns:
| Category | Examples | Why It Fails |
|---|---|---|
| Hedging | "arguably", "might", "could potentially", "seems to", "may indicate" | Signals uncertainty in claims that should be stated directly. Undermines credibility. |
| Filler Phrases | "It is important to note", "Furthermore", "In conclusion", "At the end of the day", "Here's the thing" | Padding that adds length without adding information. Marks text as AI-generated. |
| Generic Adjectives | "significant", "notable", "key", "important", "critical", "major", "groundbreaking" | Vague intensifiers that substitute for specific data. "Revenue grew significantly" vs "Revenue grew 23%". |
| Passive Voice | "was published by", "were confirmed", "has been validated", "will be announced" | Obscures agency and reduces clarity. "The team published" beats "was published by the team". |
| WH-Openers | "What makes this different…", "Why does this matter?", "How did this happen?" | Rhetorical framing devices that avoid making direct claims. Common in generic AI explanations. |
| Binary Contrasts | "Not a setback. A strategic pivot.", "Not cost-cutting. Efficiency optimization." | Performative reframing that reads as spin, not analysis. |
| Em-Dash Overuse | Three or more em-dashes in a single response — the most reliable AI tell — alongside other patterns. | Overuse of em-dashes to create fake rhythm is one of the most consistent markers of AI-generated prose. |
The Stop-Slop Scoring Engine#
The scoreContent() function from lib/utils/stop-slop-filter is a deterministic, rule-based scorer. It does not use an LLM. It runs regex and token-level pattern matching across the five dimensions below, returning scores on a 0–100 scale per dimension and an overall weighted composite.
Because the scorer is deterministic, Writing Quality benchmark results have zero variance — the same text always produces the same score. This makes the benchmark ideal for regression testing after model updates.
The 5-Dimension Rubric#
| Dimension | What It Measures | Violations That Hurt It |
|---|---|---|
| Directness | Are claims stated without hedging, filler, or rhetorical padding? | Hedging words, WH-openers, filler phrases, binary contrasts |
| Rhythm | Is the sentence structure varied and natural, not formulaic? | Em-dash overuse, repetitive sentence structures, passive voice patterns |
| Trust | Does the prose read as if written by a knowledgeable expert, not a generic AI? | Generic adjectives, filler phrases, hedging, binary contrasts |
| Authenticity | Is the writing specific and grounded in real detail? | Generic adjectives instead of specific numbers, passive voice obscuring agency |
| Information Density | Is each sentence carrying real informational weight? | Filler phrases, WH-openers, binary contrasts that restate rather than inform |
Overall Score Formula
Pass/Fail Threshold#
A text sample passes when its overall score ≥ 70/100. The benchmark measures pass rate (what fraction of the 50 prompts score above threshold) and accuracy against ground truth (what fraction of intentionally-bad samples correctly score below threshold AND what fraction of clean controls correctly score above threshold).
What 70 Means
Benchmark Metrics#
| Metric | Definition | Range |
|---|---|---|
| Pass Rate | Fraction of the 50 prompts that score ≥ 70. Primary leaderboard metric. | 0–100% |
| Average Overall Score | Mean overall score across all 50 prompts. | 0–100 |
| Average Directness | Mean directness score across all 50 prompts. | 0–100 |
| Average Rhythm | Mean rhythm score across all 50 prompts. | 0–100 |
| Average Trust | Mean trust score across all 50 prompts. | 0–100 |
| Average Authenticity | Mean authenticity score across all 50 prompts. | 0–100 |
| Average Density | Mean information density score across all 50 prompts. | 0–100 |
| Average Violations | Mean violation count per prompt. | Count |
| Trend | Direction of change vs prior runs: IMPROVING, STABLE, or DECLINING. | Categorical |
Learning Signals#
When a prompt fails (overall score < 70), the benchmark emits a BenchmarkLearningSignalwith benchmarkType = WRITING_QUALITY and failureType set to the violation category that caused the failure.
The nightly learning loop reads these signals and adds new StopSlopPattern records for failure types that are not already in the library. This expands the stop-slop detection vocabulary over time, improving detection of emerging AI writing patterns.
Reading the Results#
Interpreting Writing Quality Scores
- Pass Rate > 80%: Model consistently writes clean, direct prose. Low editing overhead for report delivery.
- Pass Rate 60–80%: Mixed. Analyst-style prompts may pass; research summaries may need review.
- Pass Rate < 60%: High editing overhead. Model produces significant AI-slop patterns in output.
- Check the radar chart on the benchmark page to see which dimension is dragging the score down.
- High Trust + Low Directness: Model sounds knowledgeable but hedges too much. Reduce temperature or prompt more directly.
- Low Authenticity: Model uses generic adjectives instead of specific data. Prompt for numerical specificity.