A/B Testing

Compare prompt versions with statistical rigor to make confident, data-driven decisions about which performs better.

What is A/B Testing?#

A/B testing (also called split testing) compares two prompt versions by running them against the same inputs and statistically analyzing which performs better. Unlike casual comparison, A/B testing provides mathematical confidence that differences are real, not due to random variation.

Controlled Comparison

Both versions run on identical inputs for fair, unbiased comparison.

Statistical Analysis

Get p-values, confidence intervals, and effect sizes for your results.

Clear Winner

Know with 95%+ confidence which version performs better.

Quantify Improvement

Measure exactly how much better one version is (e.g., 12% improvement).

A/B vs. Batch Evaluation

Batch evaluation tells you how well a prompt performs. A/B testing tells you which of two prompts performs better—and whether that difference is statistically significant. Use batch evaluation for quality measurement, A/B testing for decision-making.

When to Use A/B Testing#

A/B testing is most valuable when you need to make a decision between two approaches and want confidence in your choice:

Scenario	Use A/B Testing?	Why
Choosing between two prompt approaches	Yes	Get statistical confidence in which is better
Validating a prompt change improves quality	Yes	Confirm improvement is real, not noise
Comparing different AI models	Yes	Understand which model works best for your use case
Quick feedback during development	No	Use Playground for fast iteration
General quality check before deployment	No	Use batch evaluation for overall quality
Regression testing against baseline	Maybe	A/B test for close calls, regression for gates

Setting Up a Test#

Follow these steps to run a rigorous A/B test:

Define Your Hypothesis

Before testing, clearly state what you're testing and what you expect. For example: "Version B with explicit step-by-step instructions will score higher on completeness than Version A with general instructions."

Select Version A (Control)

Choose your baseline version—usually the current production prompt or the simpler of two approaches. This is your control against which you'll measure improvement.

Select Version B (Challenger)

Choose the version you want to test against the baseline. This should differ from Version A in a specific, documented way.

Choose Test Dataset

Select a test dataset with representative inputs. The dataset should be large enough for statistical significance (minimum 30 cases, ideally 50+).

Configure Success Metrics

Decide which metrics matter most before seeing results. Define your primary metric (the one that determines the winner) and any secondary metrics to monitor.

Run the Test

Click Start A/B Test. Both versions run against every test case. Results are collected and analyzed automatically.

A/B Test Configuration

json

{
  "testName": "Step-by-step vs. General Instructions",
  "hypothesis": "Explicit steps will improve completeness",
  "versionA": {
    "promptId": "prompt_123",
    "version": 5,
    "label": "General instructions (control)"
  },
  "versionB": {
    "promptId": "prompt_123",
    "version": 6,
    "label": "Step-by-step instructions (challenger)"
  },
  "datasetId": "dataset_456",
  "primaryMetric": "completeness",
  "secondaryMetrics": ["coherence", "overall"],
  "confidenceLevel": 0.95,
  "minimumEffectSize": 0.1
}

Sample Size & Power#

Sample size determines how small a difference you can reliably detect. Too few test cases means you might miss real improvements or falsely conclude there's no difference.

Sample Size	Detectable Effect	Statistical Power	Use Case
30 cases	Large (0.5+ points)	~80%	Quick checks, obvious differences
50 cases	Medium (0.3+ points)	~80%	Standard A/B tests
100 cases	Small (0.2+ points)	~80%	Precise comparison, subtle changes
200+ cases	Very small (0.1+ points)	~90%	High-stakes decisions, fine-tuning

Power Calculator

PromptReports includes a power calculator that recommends sample size based on:

The minimum effect size you care about detecting
Your desired confidence level (typically 95%)
Historical variance in your quality scores

Find it in the A/B test setup under "Advanced Settings."

Statistical Significance#

A/B testing uses statistical methods to determine if observed differences are real or just random noise:

Concept	Description	Target Value
p-value	Probability the difference is due to chance alone	< 0.05 (5%)
Confidence Interval	Range likely containing the true difference	Narrow range, not crossing zero
Effect Size	Magnitude of the difference (in score points)	Depends on your requirements
Statistical Power	Ability to detect real differences	> 0.8 (80%)

Example Statistical Output

json

{
  "comparison": {
    "versionA_mean": 4.12,
    "versionB_mean": 4.38,
    "difference": 0.26,
    "percentImprovement": 6.3
  },
  "statistics": {
    "pValue": 0.023,
    "confidenceInterval": [0.04, 0.48],
    "effectSize": 0.34,
    "statisticalPower": 0.85
  },
  "conclusion": "statistically_significant",
  "winner": "versionB",
  "recommendation": "Version B shows a statistically significant improvement of 6.3% in overall quality. The difference is unlikely to be due to chance (p = 0.023). Recommend deploying Version B."
}

Don't Cherry-Pick Metrics

Decide on your success metrics before running the test. Choosing metrics after seeing results leads to false conclusions (p-hacking). If you test 20 metrics, one will likely show significance by chance alone.

Interpreting Results#

After the test completes, you'll see one of these outcomes:

Clear Winner (p < 0.05)

One version is statistically significantly better. Safe to deploy the winner with high confidence.

No Significant Difference

Versions perform similarly. Choose based on other factors like simplicity or cost.

Marginal Significance (p < 0.10)

Suggestive but not conclusive. Consider running with more test cases.

Inconclusive (Low Power)

Not enough data to detect differences. Run with a larger dataset.

The results dashboard shows:

Winner Declaration: Which version won (if any) with confidence level
Score Comparison: Side-by-side metrics with improvement percentages
Statistical Details: p-values, confidence intervals, effect sizes
Distribution Visualization: Overlapping histograms showing score distributions
Case-by-Case Breakdown: How each version performed on specific inputs

Pairwise Comparison#

While A/B testing provides aggregate statistics, pairwise comparison lets you examine how versions perform on each individual test case:

Side-by-Side Outputs

View both responses for each input to understand qualitative differences.

Per-Case Winner

See which version won each case and by how much.

Win/Loss/Tie Counts

Overall tally of how often each version won, lost, or tied.

Category Analysis

See which categories each version excels or struggles with.

Pairwise analysis helps you understand why one version is better—not just that it is. You might discover that Version B wins on complex cases but loses on simple ones, informing a more nuanced decision.

Live A/B Testing#

For production prompts, you can run live A/B tests that split real traffic between versions:

Configure Traffic Split

Define what percentage of traffic goes to each version (e.g., 50/50 or 90/10 for cautious rollouts).

Set Duration or Sample Size

Specify when the test should end: after a time period, reaching a sample size, or achieving statistical significance.

Monitor in Real-Time

Watch results accumulate in the dashboard. Pause the test if you see significant issues.

Auto-Resolution

Optionally, let PromptReports automatically promote the winner when significance is reached.

Traffic Split	Risk Level	Use Case
50/50	Medium	Standard test when both versions are production-ready
90/10	Low	Testing new version cautiously while maintaining stability
95/5	Very Low	Gradual rollout with early signal detection
0/100 (Shadow)	None	Run new version in shadow mode without affecting users

Production Considerations

Live A/B tests affect real users. Before running:

Ensure both versions meet minimum quality standards
Set up monitoring for error rates and user complaints
Have a rollback plan if issues arise
Consider starting with a small traffic percentage

Best Practices#

Follow these guidelines for reliable, actionable A/B testing:

Test One Change at a Time

If Version B differs in multiple ways from Version A, you won't know which change caused the improvement. Isolate variables for clear learnings.

Define Success Upfront

Before running the test, document your primary metric, minimum effect size you care about, and what decision you'll make for each possible outcome.

Use Representative Data

Your test dataset should reflect real usage. Skewed test data leads to conclusions that don't hold in production.

Ensure Adequate Sample Size

Use the power calculator to determine how many test cases you need. Underpowered tests waste effort and give inconclusive results.

Consider Practical Significance

A 0.05-point improvement might be statistically significant but not practically meaningful. Define what improvement is worth deploying.

Document Everything

Record your hypothesis, test configuration, results, and decision. Future you (and teammates) will appreciate the context.

Common Pitfalls#

Avoid these common mistakes that undermine A/B testing validity:

Pitfall	Problem	Solution
Peeking at Results	Early stopping inflates false positive rate	Pre-register sample size; use sequential testing if you must peek
Multiple Comparisons	Testing many metrics increases false positives	Declare primary metric upfront; apply Bonferroni correction if testing multiple
Imbalanced Dataset	Test data doesn't match production distribution	Use stratified sampling or real production data
Ignoring Variance	High variance means less reliable results	Increase sample size or reduce temperature for consistency
Post-Hoc Rationalization	Explaining away unexpected results	Trust the data; re-run if you suspect issues
Small Effect Obsession	Optimizing for tiny improvements	Focus on changes that matter practically

When Results Surprise You

If the results contradict your hypothesis, resist the urge to dismiss them. Surprising results are often the most valuable learnings. Investigate why—you might discover something important about your users or use case.

A/B testing is one of the most powerful tools for data-driven prompt optimization. Combined with quality metrics, it enables confident, systematic improvement of your prompts.

PreviousRunning Evaluations

NextQuality Metrics