Skip to main content

A/B Testing

Compare prompt versions with statistical rigor to make confident, data-driven decisions about which performs better.

What is A/B Testing?#

A/B testing (also called split testing) compares two prompt versions by running them against the same inputs and statistically analyzing which performs better. Unlike casual comparison, A/B testing provides mathematical confidence that differences are real, not due to random variation.

Controlled Comparison

Both versions run on identical inputs for fair, unbiased comparison.

Statistical Analysis

Get p-values, confidence intervals, and effect sizes for your results.

Clear Winner

Know with 95%+ confidence which version performs better.

Quantify Improvement

Measure exactly how much better one version is (e.g., 12% improvement).

When to Use A/B Testing#

A/B testing is most valuable when you need to make a decision between two approaches and want confidence in your choice:

ScenarioUse A/B Testing?Why
Choosing between two prompt approachesYesGet statistical confidence in which is better
Validating a prompt change improves qualityYesConfirm improvement is real, not noise
Comparing different AI modelsYesUnderstand which model works best for your use case
Quick feedback during developmentNoUse Playground for fast iteration
General quality check before deploymentNoUse batch evaluation for overall quality
Regression testing against baselineMaybeA/B test for close calls, regression for gates

Setting Up a Test#

Follow these steps to run a rigorous A/B test:

1

Define Your Hypothesis

Before testing, clearly state what you're testing and what you expect. For example: "Version B with explicit step-by-step instructions will score higher on completeness than Version A with general instructions."
2

Select Version A (Control)

Choose your baseline version—usually the current production prompt or the simpler of two approaches. This is your control against which you'll measure improvement.
3

Select Version B (Challenger)

Choose the version you want to test against the baseline. This should differ from Version A in a specific, documented way.
4

Choose Test Dataset

Select a test dataset with representative inputs. The dataset should be large enough for statistical significance (minimum 30 cases, ideally 50+).
5

Configure Success Metrics

Decide which metrics matter most before seeing results. Define your primary metric (the one that determines the winner) and any secondary metrics to monitor.
6

Run the Test

Click Start A/B Test. Both versions run against every test case. Results are collected and analyzed automatically.
A/B Test Configuration
json
{
  "testName": "Step-by-step vs. General Instructions",
  "hypothesis": "Explicit steps will improve completeness",
  "versionA": {
    "promptId": "prompt_123",
    "version": 5,
    "label": "General instructions (control)"
  },
  "versionB": {
    "promptId": "prompt_123",
    "version": 6,
    "label": "Step-by-step instructions (challenger)"
  },
  "datasetId": "dataset_456",
  "primaryMetric": "completeness",
  "secondaryMetrics": ["coherence", "overall"],
  "confidenceLevel": 0.95,
  "minimumEffectSize": 0.1
}

Sample Size & Power#

Sample size determines how small a difference you can reliably detect. Too few test cases means you might miss real improvements or falsely conclude there's no difference.

Sample SizeDetectable EffectStatistical PowerUse Case
30 casesLarge (0.5+ points)~80%Quick checks, obvious differences
50 casesMedium (0.3+ points)~80%Standard A/B tests
100 casesSmall (0.2+ points)~80%Precise comparison, subtle changes
200+ casesVery small (0.1+ points)~90%High-stakes decisions, fine-tuning

Statistical Significance#

A/B testing uses statistical methods to determine if observed differences are real or just random noise:

ConceptDescriptionTarget Value
p-valueProbability the difference is due to chance alone< 0.05 (5%)
Confidence IntervalRange likely containing the true differenceNarrow range, not crossing zero
Effect SizeMagnitude of the difference (in score points)Depends on your requirements
Statistical PowerAbility to detect real differences> 0.8 (80%)
Example Statistical Output
json
{
  "comparison": {
    "versionA_mean": 4.12,
    "versionB_mean": 4.38,
    "difference": 0.26,
    "percentImprovement": 6.3
  },
  "statistics": {
    "pValue": 0.023,
    "confidenceInterval": [0.04, 0.48],
    "effectSize": 0.34,
    "statisticalPower": 0.85
  },
  "conclusion": "statistically_significant",
  "winner": "versionB",
  "recommendation": "Version B shows a statistically significant improvement of 6.3% in overall quality. The difference is unlikely to be due to chance (p = 0.023). Recommend deploying Version B."
}

Interpreting Results#

After the test completes, you'll see one of these outcomes:

Clear Winner (p < 0.05)

One version is statistically significantly better. Safe to deploy the winner with high confidence.

No Significant Difference

Versions perform similarly. Choose based on other factors like simplicity or cost.

Marginal Significance (p < 0.10)

Suggestive but not conclusive. Consider running with more test cases.

Inconclusive (Low Power)

Not enough data to detect differences. Run with a larger dataset.

The results dashboard shows:

  • Winner Declaration: Which version won (if any) with confidence level
  • Score Comparison: Side-by-side metrics with improvement percentages
  • Statistical Details: p-values, confidence intervals, effect sizes
  • Distribution Visualization: Overlapping histograms showing score distributions
  • Case-by-Case Breakdown: How each version performed on specific inputs

Pairwise Comparison#

While A/B testing provides aggregate statistics, pairwise comparison lets you examine how versions perform on each individual test case:

Side-by-Side Outputs

View both responses for each input to understand qualitative differences.

Per-Case Winner

See which version won each case and by how much.

Win/Loss/Tie Counts

Overall tally of how often each version won, lost, or tied.

Category Analysis

See which categories each version excels or struggles with.

Pairwise analysis helps you understand why one version is better—not just that it is. You might discover that Version B wins on complex cases but loses on simple ones, informing a more nuanced decision.

Live A/B Testing#

For production prompts, you can run live A/B tests that split real traffic between versions:

1

Configure Traffic Split

Define what percentage of traffic goes to each version (e.g., 50/50 or 90/10 for cautious rollouts).
2

Set Duration or Sample Size

Specify when the test should end: after a time period, reaching a sample size, or achieving statistical significance.
3

Monitor in Real-Time

Watch results accumulate in the dashboard. Pause the test if you see significant issues.
4

Auto-Resolution

Optionally, let PromptReports automatically promote the winner when significance is reached.
Traffic SplitRisk LevelUse Case
50/50MediumStandard test when both versions are production-ready
90/10LowTesting new version cautiously while maintaining stability
95/5Very LowGradual rollout with early signal detection
0/100 (Shadow)NoneRun new version in shadow mode without affecting users

Best Practices#

Follow these guidelines for reliable, actionable A/B testing:

1

Test One Change at a Time

If Version B differs in multiple ways from Version A, you won't know which change caused the improvement. Isolate variables for clear learnings.
2

Define Success Upfront

Before running the test, document your primary metric, minimum effect size you care about, and what decision you'll make for each possible outcome.
3

Use Representative Data

Your test dataset should reflect real usage. Skewed test data leads to conclusions that don't hold in production.
4

Ensure Adequate Sample Size

Use the power calculator to determine how many test cases you need. Underpowered tests waste effort and give inconclusive results.
5

Consider Practical Significance

A 0.05-point improvement might be statistically significant but not practically meaningful. Define what improvement is worth deploying.
6

Document Everything

Record your hypothesis, test configuration, results, and decision. Future you (and teammates) will appreciate the context.

Common Pitfalls#

Avoid these common mistakes that undermine A/B testing validity:

PitfallProblemSolution
Peeking at ResultsEarly stopping inflates false positive ratePre-register sample size; use sequential testing if you must peek
Multiple ComparisonsTesting many metrics increases false positivesDeclare primary metric upfront; apply Bonferroni correction if testing multiple
Imbalanced DatasetTest data doesn't match production distributionUse stratified sampling or real production data
Ignoring VarianceHigh variance means less reliable resultsIncrease sample size or reduce temperature for consistency
Post-Hoc RationalizationExplaining away unexpected resultsTrust the data; re-run if you suspect issues
Small Effect ObsessionOptimizing for tiny improvementsFocus on changes that matter practically

A/B testing is one of the most powerful tools for data-driven prompt optimization. Combined with quality metrics, it enables confident, systematic improvement of your prompts.