Quality Metrics

Understand and configure the metrics used to evaluate prompt quality, from built-in standards to custom domain-specific criteria.

Understanding Quality Metrics#

Quality metrics are standardized measurements used to evaluate how well a prompt's output meets expectations. They transform subjective quality assessment into objective, comparable numbers—enabling data-driven prompt optimization.

PromptReports uses AI-powered evaluation to score outputs automatically, providing:

Consistent Measurement

Same criteria applied uniformly across all evaluations for fair comparison.

Intelligent Assessment

AI evaluator understands context and nuance, not just keyword matching.

Multi-Dimensional

Capture different aspects of quality with multiple complementary metrics.

Explainable Scores

Get written rationale for each score, not just numbers.

Built-in Metrics#

PromptReports includes these standard quality metrics that work across most use cases:

Metric	What It Measures	Key Indicators
Relevance	Does the response address what was asked?	On-topic, answers the question, doesn't go off on tangents
Coherence	Is the response well-organized and clear?	Logical flow, clear transitions, easy to follow
Completeness	Are all required elements present?	Covers all parts of the request, nothing important missing
Accuracy	Is the information factually correct?	True statements, correct calculations, valid reasoning
Tone	Does it match the expected style?	Appropriate formality, consistent voice, suitable sentiment
Overall	Aggregate quality assessment	Weighted combination of all metrics

Relevance

How well does the response actually answer what was asked? Does it stay on topic and address the core request?

Coherence

Is the response well-organized, logically structured, and easy to follow? Do ideas flow naturally?

Completeness

Are all required elements present? Does it fully address all parts of the request without omitting key details?

Accuracy

Is the information factually correct? Are calculations right and reasoning valid?

Metric Selection

You don't have to use all metrics. Enable only those relevant to your use case. For example, a creative writing prompt might skip "Accuracy" but emphasize "Coherence" and a custom "Creativity" metric.

Metric Scoring Scale#

Each metric is scored on a consistent 1-5 scale:

Score	Rating	Description	Action Needed
5	Excellent	Exceeds expectations, exemplary quality, no issues	None - this is the goal
4	Good	Meets expectations with only minor issues	Minor refinements possible
3	Acceptable	Functional but with notable issues	Improvements recommended
2	Poor	Significant problems affecting usefulness	Prompt revision needed
1	Failing	Unacceptable, major issues present	Significant rework required

Half-point scores (e.g., 3.5, 4.5) are used when quality falls between two levels. The evaluator provides specific rationale for each score, explaining exactly what pushed the score up or down.

How Scoring Works#

PromptReports uses AI-powered evaluation to score outputs. Here's how the process works:

Input Analysis

The evaluator examines the original input and understands what was requested.

Output Examination

The actual output is analyzed against what the input requested.

Expected Comparison

If an expected output is provided, the evaluator compares against it.

Metric Assessment

Each enabled metric is evaluated independently using consistent criteria.

Rationale Generation

Written explanations are generated for each score.

Aggregate Calculation

Overall score is computed using configured weights.

Example Scoring Output

json

{
  "inputSummary": "Customer asking about return policy for electronics",
  "outputLength": 245,
  "scores": {
    "relevance": {
      "score": 5,
      "rationale": "Response directly addresses the return policy question with specific details about electronics. No off-topic content."
    },
    "coherence": {
      "score": 4,
      "rationale": "Well-organized with clear steps. Minor issue: could benefit from numbered list format."
    },
    "completeness": {
      "score": 4.5,
      "rationale": "Covers return window, process, and conditions. Slight deduction for not mentioning receipt requirements."
    },
    "accuracy": {
      "score": 5,
      "rationale": "All policy details match expected reference. No factual errors."
    },
    "tone": {
      "score": 5,
      "rationale": "Friendly, helpful, and professional. Appropriate for customer support context."
    }
  },
  "overall": 4.7,
  "summary": "High-quality response that effectively addresses the customer's question with accurate policy information and appropriate tone."
}

Custom Metrics#

Create custom evaluation criteria for domain-specific quality requirements that built-in metrics don't capture:

Define the Metric

Give it a clear name and description of what it measures.

Specify Scoring Criteria

List the specific characteristics the evaluator should look for at each score level.

Provide Examples

Include examples of outputs at different score levels to calibrate the evaluator.

Set Weight

Determine how much this metric contributes to the overall score.

Test & Calibrate

Run evaluations and adjust criteria based on scoring behavior.

Custom Metric Definition

json

{
  "name": "Brand Voice",
  "description": "Alignment with company brand guidelines and communication standards",
  "scoringCriteria": {
    "5": "Perfect brand voice: uses all approved terminology, maintains consistent tone, embodies company values throughout",
    "4": "Strong brand voice with minor deviations: mostly aligned with guidelines, occasional neutral language",
    "3": "Acceptable brand voice: recognizable as company content but lacks some brand elements",
    "2": "Weak brand voice: generic language, missing brand personality, some off-brand phrasing",
    "1": "Off-brand: uses prohibited terms, inappropriate tone, or contradicts brand values"
  },
  "evaluationGuidelines": [
    "Check for use of approved terminology (see brand guide)",
    "Verify consistent first-person plural ('we') usage",
    "Confirm positive, solution-oriented framing",
    "Watch for prohibited phrases from blocklist"
  ],
  "examples": {
    "5": "We're here to help! Our team will have this sorted for you within 24 hours.",
    "3": "Your issue has been noted. Someone will contact you.",
    "1": "Unfortunately, that's not possible. You'll have to wait."
  },
  "weight": 1.5
}

Safety & Compliance

Check for prohibited content, required disclosures, or regulatory compliance.

Creativity

Measure originality, engaging language, or innovative problem-solving.

Format Adherence

Verify output follows required structure, length limits, or formatting rules.

Technical Accuracy

Assess domain-specific correctness for technical or specialized content.

Be Specific

Vague criteria lead to inconsistent scoring. Instead of "good writing," specify measurable characteristics: "sentences under 25 words," "active voice usage," "no jargon." The more specific your criteria, the more reliable your scores.

Weighted Scoring#

Not all metrics are equally important for every use case. Configure weights to emphasize what matters most:

Metric	Default Weight	Example: Support Bot	Example: Content Writer
Relevance	1.0	1.5 (critical to answer correctly)	1.0 (standard)
Coherence	1.0	0.8 (less critical)	1.5 (very important for readability)
Completeness	1.0	1.2 (need full answers)	1.0 (standard)
Accuracy	1.0	1.5 (must be correct)	0.8 (creative latitude)
Tone	1.0	1.3 (brand representation)	1.2 (engagement)

Weight Configuration

json

{
  "weights": {
    "relevance": 1.5,
    "coherence": 0.8,
    "completeness": 1.2,
    "accuracy": 1.5,
    "tone": 1.3,
    "brand_voice": 1.5
  },
  "overallCalculation": "weighted_average",
  "minimumThresholds": {
    "accuracy": 4.0,
    "brand_voice": 3.5
  }
}

The overall score is calculated as a weighted average: sum(score * weight) / sum(weights). You can also set minimum thresholds for critical metrics—if any threshold isn't met, the case is flagged regardless of overall score.

Interpreting Scores#

Use these guidelines to translate numeric scores into actionable insights:

Score Range	Quality Level	Interpretation	Recommended Action
4.5 - 5.0	Production-ready	Excellent quality suitable for customer-facing use	Deploy with confidence
4.0 - 4.4	Good	Solid quality with minor improvement opportunities	Deployable, consider refinements
3.5 - 3.9	Acceptable	Functional but noticeable issues exist	OK for internal use, improve for external
3.0 - 3.4	Needs Work	Significant gaps affecting user experience	Address issues before deployment
Below 3.0	Failing	Major problems, not ready for use	Substantial revision required

Track Trends

Monitor how scores change across versions to ensure continuous improvement.

Watch Variance

High score variance indicates inconsistent prompt behavior that needs attention.

Category Analysis

Break down scores by input category to find specific weak points.

Distribution Shape

Check if scores cluster high (good) or spread out (inconsistent).

Context Matters

Score interpretation depends on use case. A 4.0 might be excellent for a complex analysis task but concerning for a simple FAQ bot. Calibrate expectations based on task difficulty and stakes.

Quality Thresholds#

Set quality thresholds to create automated quality gates and alerts:

Use Case	Recommended Threshold	Rationale
Customer Support	4.5	High stakes, brand reputation, user trust
Content Marketing	4.0	Quality matters but some variation acceptable
Internal Tools	3.5	Efficiency-focused, expert users can handle imperfection
Prototyping	3.0	Exploring ideas, not production-ready
Data Processing	4.0 (Accuracy: 4.5)	Overall can be moderate but accuracy critical

Thresholds are used in multiple contexts:

Regression Testing: Block promotions if quality drops below threshold
Evaluation Reports: Highlight cases failing to meet threshold
Quality Alerts: Notify when production prompts fall below threshold
CI/CD Integration: Fail builds if evaluations don't pass
Pass Rate Calculation: Count percentage of cases meeting threshold

Threshold Configuration

json

{
  "thresholds": {
    "overall": 4.0,
    "accuracy": 4.5,
    "brand_voice": 4.0
  },
  "passRequirements": {
    "overall": ">=",
    "accuracy": ">="
  },
  "actions": {
    "onThresholdFailed": ["alert_owner", "block_promotion"],
    "onAllThresholdsPassed": ["allow_promotion"]
  },
  "alerts": {
    "recipients": ["team@example.com"],
    "frequency": "immediate"
  }
}

Tracking Over Time#

PromptReports tracks quality metrics over time, enabling trend analysis and early detection of degradation:

Score Trends

Visualize how quality changes across versions and over time.

Historical Comparison

Compare any evaluation to historical baselines.

Degradation Alerts

Get notified when quality trends downward over multiple evaluations.

Metric Breakdown

Track individual metrics to see which aspects improve or decline.

Quality tracking helps you:

Verify that prompt changes actually improve quality
Detect gradual degradation from model updates or data drift
Demonstrate quality improvements to stakeholders
Identify the best-performing version for rollback if needed
Build confidence in your prompt development process

Advanced Scoring#

For specialized evaluation needs, PromptReports supports advanced scoring configurations:

Custom Evaluator Prompts

Write your own evaluation prompts for complete control over scoring logic.

Multi-Model Scoring

Use different models for generation and evaluation to reduce bias.

Human-in-the-Loop

Route edge cases to human reviewers for manual scoring.

Programmatic Scoring

Add code-based checks for format validation, length limits, or regex patterns.

Advanced Scoring Configuration

json

{
  "scoring": {
    "aiScoring": {
      "enabled": true,
      "model": "gpt-4-turbo",
      "customPrompt": "Evaluate this customer service response...",
      "temperature": 0.1
    },
    "programmaticChecks": [
      {
        "name": "length_check",
        "type": "range",
        "field": "output_length",
        "min": 50,
        "max": 500,
        "scoreImpact": -1
      },
      {
        "name": "prohibited_words",
        "type": "regex_absent",
        "pattern": "(?i)(unfortunately|cannot|impossible)",
        "scoreImpact": -0.5
      }
    ],
    "humanReview": {
      "enabled": true,
      "triggerCondition": "score < 3.5 OR variance > 1.0",
      "assignTo": "quality_team"
    }
  }
}

Scoring Model Selection

Use a capable model (like GPT-4) for evaluation scoring. Less capable models may miss nuances or score inconsistently. The evaluation model doesn't need to be the same as your generation model.

Quality metrics are the foundation of systematic prompt improvement. With consistent measurement, you can make confident decisions about which prompts to deploy and continuously raise the bar on output quality.

PreviousA/B Testing

NextCollaboration Overview