Skip to main content

Running Evaluations

Execute your prompts against test datasets, analyze quality metrics, and ensure consistent performance before deployment.

Evaluation Basics#

An evaluation runs your prompt against every row in a test dataset, collecting outputs and quality metrics for analysis. This systematic approach replaces ad-hoc testing with reproducible, quantifiable quality measurement.

Systematic Testing

Every test case is executed with identical conditions for fair comparison.

Quality Metrics

Automatic scoring across relevance, coherence, completeness, and more.

Version Tracking

Compare results across prompt versions to measure improvement.

Fast Execution

Parallel processing enables evaluation of hundreds of cases in minutes.

Running an Evaluation#

Follow these steps to run a comprehensive evaluation:

1

Select Your Prompt

Navigate to the prompt you want to evaluate. Open the Evaluation tab from the prompt editor or detail view. You can evaluate any prompt version, not just the latest.
2

Choose a Test Dataset

Select a dataset from your available options. You can filter by dataset name, tags, or the prompt it's associated with. Consider using a pinned version for reproducibility.
3

Configure Settings

Adjust model parameters (temperature, max tokens), parallel request settings, and scoring criteria. Use production-like settings for realistic results.
4

Start the Evaluation

Click Run Evaluation to begin processing all test cases. A progress indicator shows completion status. You can navigate away—results are saved automatically.
5

Review Results

Once complete, view aggregate metrics and drill into individual test case results. Use filters to focus on specific categories or problem areas.

Evaluation Settings#

Configure these settings before running an evaluation to control execution behavior and scoring:

SettingDescriptionRecommendation
ModelAI model to use for this evaluationUse the same model you'll use in production
TemperatureResponse randomness (0-1)Lower values (0.3-0.5) for more consistent, reproducible results
Max TokensMaximum response lengthSet based on expected output length plus 20% buffer
Parallel RequestsNumber of concurrent API calls3-5 balances speed and reliability; higher risks rate limits
Request TimeoutMaximum wait time per request30-60 seconds for complex prompts; increase for long outputs
Retry FailedAutomatically retry failed requestsEnable with 2-3 retries and exponential backoff
Auto-ScoreEnable automatic quality scoringYes, unless you prefer manual review
Score ModelModel used for scoring (if different)Use a capable model like GPT-4 for accurate scoring
Example Evaluation Configuration
json
{
  "model": "gpt-4-turbo",
  "temperature": 0.4,
  "maxTokens": 2000,
  "parallelRequests": 5,
  "timeoutMs": 45000,
  "retryConfig": {
    "maxRetries": 3,
    "backoffMultiplier": 2
  },
  "scoring": {
    "enabled": true,
    "model": "gpt-4-turbo",
    "metrics": ["relevance", "coherence", "completeness", "accuracy", "tone"]
  }
}

Parallel Processing#

PromptReports processes multiple test cases simultaneously to reduce evaluation time. Configure parallelization based on your API rate limits and reliability requirements.

Speed vs. Reliability

Higher parallelism is faster but increases risk of rate limiting or errors.

Automatic Retry

Failed requests are automatically retried with exponential backoff.

Batch Ordering

Cases are processed in parallel but results maintain original order.

Rate Limit Awareness

Built-in throttling respects API rate limits to prevent failures.

Parallel RequestsSpeedReliabilityBest For
1-2SlowHighestTesting settings, debugging issues
3-5MediumHighStandard evaluations, most use cases
6-10FastMediumLarge datasets, high API limits
10+FastestLowerEnterprise accounts with high limits

Understanding Results#

After an evaluation completes, you'll see a comprehensive results dashboard:

MetricDescriptionHealthy Range
Overall ScoreWeighted average across all quality metrics4.0+ for production-ready prompts
Pass RatePercentage of cases meeting quality threshold (default: 4.0)90%+ for critical customer-facing prompts
Avg Response TimeMean response latency across all casesUnder 5 seconds for interactive use
Token UsageTotal input/output tokens consumedMonitor for cost optimization
Error RatePercentage of failed API requestsUnder 1% with retries enabled
Score VarianceStandard deviation of quality scoresLow variance indicates consistent behavior

Score Distribution

Histogram showing how scores are distributed across test cases.

Category Breakdown

See average scores per category to identify problem areas.

Low Score Alerts

Automatic highlighting of cases scoring below threshold.

Performance Metrics

Latency percentiles (p50, p90, p99) for understanding response times.

Reviewing Individual Outputs#

Drill down into individual test case results to understand quality patterns and identify specific improvements:

Full Output View

See the complete response with syntax highlighting and formatting.

Score Breakdown

View individual metric scores with detailed scoring rationale.

Input Context

See the exact variable values and context used for each case.

Performance Data

Response time and token counts for each individual execution.

For each test case, you can:

  • Compare to Expected: Side-by-side view of actual vs. expected output
  • View Scoring Rationale: Detailed explanation of why each metric received its score
  • Flag for Review: Mark cases that need human review or prompt adjustment
  • Add Notes: Document observations for future reference
  • Re-run Single Case: Execute just this case with different settings

Comparing Evaluations#

Compare evaluation results across different prompt versions, models, or settings to make data-driven decisions:

1

Select Evaluations to Compare

Choose two or more completed evaluations from your history. They should use the same or similar datasets for meaningful comparison.
2

View Side-by-Side Metrics

See aggregate metrics compared in a table, with color coding to highlight improvements and regressions.
3

Analyze Case-by-Case Differences

For each test case, view both outputs and their respective scores to understand where versions differ.
4

Generate Comparison Report

Create a summary report documenting the comparison results for stakeholders or future reference.

Regression Testing#

Regression testing automatically compares new prompt versions against a baseline to prevent quality degradation:

Baseline Comparison

New versions are compared against the current production baseline.

Regression Detection

Automatic alerts when quality drops below the baseline threshold.

Quality Gates

Block promotions if quality regresses more than the allowed percentage.

Promotion Blocking

Prevent deployment of versions that fail regression tests.

ThresholdBehaviorUse Case
0% (No Regression)Block if any metric is lowerCritical customer-facing prompts
5% (Default)Block if overall score drops more than 5%Standard production prompts
10% (Lenient)Allow small regressions for faster iterationInternal tools, experimentation
Custom per MetricDifferent thresholds for each quality metricNuanced quality requirements

To set up regression testing:

  • Navigate to your prompt's Settings tab
  • Enable Regression Testing
  • Select a baseline version (usually the current production version)
  • Choose a test dataset to use for regression checks
  • Set your regression threshold (default: 5%)
  • Optionally enable automatic regression testing on version creation

Exporting Results#

Export evaluation results for further analysis, reporting, or integration with external tools:

FormatContentsBest For
CSVAll test cases with inputs, outputs, and scoresSpreadsheet analysis, sharing with stakeholders
JSONComplete evaluation data including metadataIntegration with other tools, programmatic analysis
PDF ReportFormatted summary with charts and key metricsExecutive summaries, documentation, audits
HTML ReportInteractive report for web viewingSharing via email or internal portals
Export JSON Structure
json
{
  "evaluationId": "eval_abc123",
  "promptId": "prompt_xyz",
  "promptVersion": 5,
  "datasetId": "dataset_456",
  "datasetVersion": 2,
  "runAt": "2024-01-15T10:30:00Z",
  "config": { /* evaluation settings */ },
  "summary": {
    "totalCases": 100,
    "passRate": 0.92,
    "overallScore": 4.23,
    "avgLatencyMs": 2340,
    "totalTokens": 125000
  },
  "results": [
    {
      "caseId": "case_1",
      "input": { /* input variables */ },
      "output": "...",
      "expectedOutput": "...",
      "scores": { "relevance": 4, "coherence": 5, /* ... */ },
      "latencyMs": 2100,
      "tokensUsed": 1250
    }
    /* ... more cases */
  ]
}

Scheduling Evaluations#

Schedule evaluations to run automatically for continuous quality monitoring:

Periodic Runs

Schedule daily, weekly, or monthly evaluations on production prompts.

Quality Alerts

Get notified if scheduled evaluations detect quality degradation.

Trend Analysis

Track quality metrics over time to spot gradual degradation.

Auto-Baseline

Automatically update baselines when quality improves.

Scheduled evaluations are especially useful for:

  • Model Updates: Detect if upstream model changes affect your prompt's quality
  • Long-Term Monitoring: Track quality trends over weeks or months
  • Compliance: Maintain audit trails of quality measurements
  • SLA Monitoring: Ensure prompts consistently meet quality requirements