Quality Metrics
Understand and configure the metrics used to evaluate prompt quality, from built-in standards to custom domain-specific criteria.
Understanding Quality Metrics#
Quality metrics are standardized measurements used to evaluate how well a prompt's output meets expectations. They transform subjective quality assessment into objective, comparable numbers—enabling data-driven prompt optimization.
PromptReports uses AI-powered evaluation to score outputs automatically, providing:
Consistent Measurement
Same criteria applied uniformly across all evaluations for fair comparison.
Intelligent Assessment
AI evaluator understands context and nuance, not just keyword matching.
Multi-Dimensional
Capture different aspects of quality with multiple complementary metrics.
Explainable Scores
Get written rationale for each score, not just numbers.
Built-in Metrics#
PromptReports includes these standard quality metrics that work across most use cases:
| Metric | What It Measures | Key Indicators |
|---|---|---|
| Relevance | Does the response address what was asked? | On-topic, answers the question, doesn't go off on tangents |
| Coherence | Is the response well-organized and clear? | Logical flow, clear transitions, easy to follow |
| Completeness | Are all required elements present? | Covers all parts of the request, nothing important missing |
| Accuracy | Is the information factually correct? | True statements, correct calculations, valid reasoning |
| Tone | Does it match the expected style? | Appropriate formality, consistent voice, suitable sentiment |
| Overall | Aggregate quality assessment | Weighted combination of all metrics |
Relevance
How well does the response actually answer what was asked? Does it stay on topic and address the core request?
Coherence
Is the response well-organized, logically structured, and easy to follow? Do ideas flow naturally?
Completeness
Are all required elements present? Does it fully address all parts of the request without omitting key details?
Accuracy
Is the information factually correct? Are calculations right and reasoning valid?
Metric Selection
Metric Scoring Scale#
Each metric is scored on a consistent 1-5 scale:
| Score | Rating | Description | Action Needed |
|---|---|---|---|
| 5 | Excellent | Exceeds expectations, exemplary quality, no issues | None - this is the goal |
| 4 | Good | Meets expectations with only minor issues | Minor refinements possible |
| 3 | Acceptable | Functional but with notable issues | Improvements recommended |
| 2 | Poor | Significant problems affecting usefulness | Prompt revision needed |
| 1 | Failing | Unacceptable, major issues present | Significant rework required |
Half-point scores (e.g., 3.5, 4.5) are used when quality falls between two levels. The evaluator provides specific rationale for each score, explaining exactly what pushed the score up or down.
How Scoring Works#
PromptReports uses AI-powered evaluation to score outputs. Here's how the process works:
Input Analysis
Output Examination
Expected Comparison
Metric Assessment
Rationale Generation
Aggregate Calculation
{
"inputSummary": "Customer asking about return policy for electronics",
"outputLength": 245,
"scores": {
"relevance": {
"score": 5,
"rationale": "Response directly addresses the return policy question with specific details about electronics. No off-topic content."
},
"coherence": {
"score": 4,
"rationale": "Well-organized with clear steps. Minor issue: could benefit from numbered list format."
},
"completeness": {
"score": 4.5,
"rationale": "Covers return window, process, and conditions. Slight deduction for not mentioning receipt requirements."
},
"accuracy": {
"score": 5,
"rationale": "All policy details match expected reference. No factual errors."
},
"tone": {
"score": 5,
"rationale": "Friendly, helpful, and professional. Appropriate for customer support context."
}
},
"overall": 4.7,
"summary": "High-quality response that effectively addresses the customer's question with accurate policy information and appropriate tone."
}Custom Metrics#
Create custom evaluation criteria for domain-specific quality requirements that built-in metrics don't capture:
Define the Metric
Specify Scoring Criteria
Provide Examples
Set Weight
Test & Calibrate
{
"name": "Brand Voice",
"description": "Alignment with company brand guidelines and communication standards",
"scoringCriteria": {
"5": "Perfect brand voice: uses all approved terminology, maintains consistent tone, embodies company values throughout",
"4": "Strong brand voice with minor deviations: mostly aligned with guidelines, occasional neutral language",
"3": "Acceptable brand voice: recognizable as company content but lacks some brand elements",
"2": "Weak brand voice: generic language, missing brand personality, some off-brand phrasing",
"1": "Off-brand: uses prohibited terms, inappropriate tone, or contradicts brand values"
},
"evaluationGuidelines": [
"Check for use of approved terminology (see brand guide)",
"Verify consistent first-person plural ('we') usage",
"Confirm positive, solution-oriented framing",
"Watch for prohibited phrases from blocklist"
],
"examples": {
"5": "We're here to help! Our team will have this sorted for you within 24 hours.",
"3": "Your issue has been noted. Someone will contact you.",
"1": "Unfortunately, that's not possible. You'll have to wait."
},
"weight": 1.5
}Safety & Compliance
Check for prohibited content, required disclosures, or regulatory compliance.
Creativity
Measure originality, engaging language, or innovative problem-solving.
Format Adherence
Verify output follows required structure, length limits, or formatting rules.
Technical Accuracy
Assess domain-specific correctness for technical or specialized content.
Be Specific
Weighted Scoring#
Not all metrics are equally important for every use case. Configure weights to emphasize what matters most:
| Metric | Default Weight | Example: Support Bot | Example: Content Writer |
|---|---|---|---|
| Relevance | 1.0 | 1.5 (critical to answer correctly) | 1.0 (standard) |
| Coherence | 1.0 | 0.8 (less critical) | 1.5 (very important for readability) |
| Completeness | 1.0 | 1.2 (need full answers) | 1.0 (standard) |
| Accuracy | 1.0 | 1.5 (must be correct) | 0.8 (creative latitude) |
| Tone | 1.0 | 1.3 (brand representation) | 1.2 (engagement) |
{
"weights": {
"relevance": 1.5,
"coherence": 0.8,
"completeness": 1.2,
"accuracy": 1.5,
"tone": 1.3,
"brand_voice": 1.5
},
"overallCalculation": "weighted_average",
"minimumThresholds": {
"accuracy": 4.0,
"brand_voice": 3.5
}
}The overall score is calculated as a weighted average: sum(score * weight) / sum(weights). You can also set minimum thresholds for critical metrics—if any threshold isn't met, the case is flagged regardless of overall score.
Interpreting Scores#
Use these guidelines to translate numeric scores into actionable insights:
| Score Range | Quality Level | Interpretation | Recommended Action |
|---|---|---|---|
| 4.5 - 5.0 | Production-ready | Excellent quality suitable for customer-facing use | Deploy with confidence |
| 4.0 - 4.4 | Good | Solid quality with minor improvement opportunities | Deployable, consider refinements |
| 3.5 - 3.9 | Acceptable | Functional but noticeable issues exist | OK for internal use, improve for external |
| 3.0 - 3.4 | Needs Work | Significant gaps affecting user experience | Address issues before deployment |
| Below 3.0 | Failing | Major problems, not ready for use | Substantial revision required |
Track Trends
Monitor how scores change across versions to ensure continuous improvement.
Watch Variance
High score variance indicates inconsistent prompt behavior that needs attention.
Category Analysis
Break down scores by input category to find specific weak points.
Distribution Shape
Check if scores cluster high (good) or spread out (inconsistent).
Context Matters
Quality Thresholds#
Set quality thresholds to create automated quality gates and alerts:
| Use Case | Recommended Threshold | Rationale |
|---|---|---|
| Customer Support | 4.5 | High stakes, brand reputation, user trust |
| Content Marketing | 4.0 | Quality matters but some variation acceptable |
| Internal Tools | 3.5 | Efficiency-focused, expert users can handle imperfection |
| Prototyping | 3.0 | Exploring ideas, not production-ready |
| Data Processing | 4.0 (Accuracy: 4.5) | Overall can be moderate but accuracy critical |
Thresholds are used in multiple contexts:
- Regression Testing: Block promotions if quality drops below threshold
- Evaluation Reports: Highlight cases failing to meet threshold
- Quality Alerts: Notify when production prompts fall below threshold
- CI/CD Integration: Fail builds if evaluations don't pass
- Pass Rate Calculation: Count percentage of cases meeting threshold
{
"thresholds": {
"overall": 4.0,
"accuracy": 4.5,
"brand_voice": 4.0
},
"passRequirements": {
"overall": ">=",
"accuracy": ">="
},
"actions": {
"onThresholdFailed": ["alert_owner", "block_promotion"],
"onAllThresholdsPassed": ["allow_promotion"]
},
"alerts": {
"recipients": ["team@example.com"],
"frequency": "immediate"
}
}Tracking Over Time#
PromptReports tracks quality metrics over time, enabling trend analysis and early detection of degradation:
Score Trends
Visualize how quality changes across versions and over time.
Historical Comparison
Compare any evaluation to historical baselines.
Degradation Alerts
Get notified when quality trends downward over multiple evaluations.
Metric Breakdown
Track individual metrics to see which aspects improve or decline.
Quality tracking helps you:
- Verify that prompt changes actually improve quality
- Detect gradual degradation from model updates or data drift
- Demonstrate quality improvements to stakeholders
- Identify the best-performing version for rollback if needed
- Build confidence in your prompt development process
Advanced Scoring#
For specialized evaluation needs, PromptReports supports advanced scoring configurations:
Custom Evaluator Prompts
Write your own evaluation prompts for complete control over scoring logic.
Multi-Model Scoring
Use different models for generation and evaluation to reduce bias.
Human-in-the-Loop
Route edge cases to human reviewers for manual scoring.
Programmatic Scoring
Add code-based checks for format validation, length limits, or regex patterns.
{
"scoring": {
"aiScoring": {
"enabled": true,
"model": "gpt-4-turbo",
"customPrompt": "Evaluate this customer service response...",
"temperature": 0.1
},
"programmaticChecks": [
{
"name": "length_check",
"type": "range",
"field": "output_length",
"min": 50,
"max": 500,
"scoreImpact": -1
},
{
"name": "prohibited_words",
"type": "regex_absent",
"pattern": "(?i)(unfortunately|cannot|impossible)",
"scoreImpact": -0.5
}
],
"humanReview": {
"enabled": true,
"triggerCondition": "score < 3.5 OR variance > 1.0",
"assignTo": "quality_team"
}
}
}Scoring Model Selection
Quality metrics are the foundation of systematic prompt improvement. With consistent measurement, you can make confident decisions about which prompts to deploy and continuously raise the bar on output quality.