Skip to main content

Test Datasets

Create and manage test datasets for systematic prompt evaluation, regression testing, and quality assurance in PromptReports.

What are Test Datasets?#

Test datasets are collections of input-output pairs used to systematically evaluate prompt quality. They form the foundation of professional prompt testing in PromptReports, enabling reproducible evaluations, regression detection, and data-driven prompt optimization.

A well-designed test dataset provides:

Structured Testing

Run prompts against consistent, predefined inputs for reliable comparison across versions.

Regression Prevention

Detect quality drops before they reach production by testing against established baselines.

Categorized Analysis

Filter and analyze results by test case categories, difficulty levels, or custom tags.

Batch Processing

Evaluate hundreds or thousands of cases in a single automated run.

Creating Datasets#

PromptReports offers multiple ways to create test datasets, each suited to different workflows:

1

Manual Entry

Add test cases one by one through the UI. Navigate to your prompt, select the Evaluation tab, click New Dataset, then add rows individually. Best for small, carefully curated datasets where each case is deliberately chosen.
2

CSV Import

Upload test cases from spreadsheets or external sources. Simply drag and drop a CSV file or click to upload. Column names are automatically mapped to prompt variables. Best for bulk data migration or when data exists in other systems.
3

From Execution History

Convert past Playground executions or production runs into test cases. PromptReports can sample from your execution history, including both inputs and actual outputs. Great for capturing real-world examples and building regression test suites.
4

AI-Generated

Use AI to generate synthetic test cases based on your prompt description and requirements. Specify the types of inputs you want (edge cases, typical usage, adversarial inputs) and let the system create diverse, representative test data automatically.
5

Public Templates

Start from community-created dataset templates and customize for your needs. Browse templates by category (customer support, content generation, analysis, etc.) and copy them to your workspace.

Dataset Structure#

Each dataset consists of rows (test cases) with the following fields:

FieldRequiredDescription
Input VariablesYesJSON object with values for each prompt variable (must match your template)
Expected OutputNoReference output for automated comparison and scoring
CategoryNoClassification label for filtered analysis (e.g., "complaints", "inquiries")
TagsNoAdditional labels for fine-grained filtering (e.g., "edge-case", "high-priority")
DifficultyNoEasy, Medium, or Hard classification for weighted analysis
PriorityNoNumeric weight (1-5) for importance in aggregate scoring
NotesNoFree-form text for context or special handling instructions
Example Dataset Row (JSON)
json
{
  "inputVars": {
    "customer_message": "I need to return this product, it's been 3 weeks",
    "customer_type": "premium",
    "product_category": "electronics",
    "order_id": "ORD-2024-98765"
  },
  "expectedOutput": "I understand you'd like to return your electronics item. As a premium customer, you have a 90-day return window, so you're still well within the return period. I'll process this return for order ORD-2024-98765 right away...",
  "category": "returns",
  "tags": ["premium", "within-policy", "electronics"],
  "difficulty": "easy",
  "priority": 3,
  "notes": "Standard return request, should reference return policy timeframe"
}

Importing Data#

Import datasets from external files for bulk test case creation. PromptReports supports multiple formats:

CSV Files

Standard comma-separated values with headers. Most common format.

JSON

Array of objects with key-value pairs. Best for complex nested data.

Excel

Upload .xlsx files directly with automatic sheet detection.

Example CSV Format
csv
customer_message,customer_type,product_category,expected_output,category,difficulty
"I need a refund",standard,clothing,"We can process your refund within 3-5 business days...",refunds,easy
"When will my order arrive?",premium,electronics,"Let me check your order status. As a premium member...",shipping,easy
"This item is broken and I want to speak to a manager",standard,furniture,"I'm so sorry about the damage. I'll connect you with...",complaints,hard
"Can I change my delivery address?",standard,clothing,"Yes, I can update your delivery address...",modifications,medium

Column names should match your prompt variables. The system will automatically map them during import. You can also manually adjust mappings if column names don't exactly match your variable names.

AI-Generated Datasets#

PromptReports can automatically generate diverse test cases using AI. This is especially useful when you need to quickly build out test coverage or explore edge cases you might not have considered.

1

Describe Your Use Case

Provide a description of what your prompt does and the types of inputs it handles.
2

Specify Categories

Define the categories of test cases you want: typical usage, edge cases, adversarial inputs, etc.
3

Set Distribution

Choose how many test cases to generate and the distribution across categories.
4

Review and Curate

Review generated cases, remove irrelevant ones, and edit as needed before saving.
AI Generation Request
json
{
  "promptDescription": "Customer support chatbot for e-commerce returns and refunds",
  "categories": {
    "typical": 20,
    "edge_cases": 10,
    "adversarial": 5,
    "multi_language": 5
  },
  "constraints": [
    "Include various customer sentiment levels",
    "Mix premium and standard customers",
    "Cover all product categories",
    "Include some requests that should be declined"
  ]
}

From Execution History#

Build test datasets from real production data by sampling from your prompt's execution history. This ensures your test cases reflect actual usage patterns.

Real-World Coverage

Test cases based on actual user inputs, not hypothetical scenarios.

Smart Sampling

Automatically sample diverse inputs across categories and time periods.

Capture Good Examples

Mark high-quality outputs as expected results for future testing.

Learn from Failures

Include past failures as test cases to prevent regression.

To create a dataset from history:

  • Navigate to your prompt's Evaluation tab
  • Click New Dataset and select From History
  • Choose a date range and sampling strategy (random, stratified by category, or focused on specific outcomes)
  • Review the selected executions and optionally mark outputs as expected results
  • Save the dataset with a descriptive name

Dataset Versioning#

Like prompts, datasets are versioned to maintain reproducibility and enable historical comparisons. When you modify a dataset, PromptReports creates a new version while preserving the original.

ActionCreates New VersionNotes
Add test casesYesNew rows appear in the latest version only
Edit existing casesYesModified rows are tracked with change history
Delete casesYesDeleted rows are soft-deleted and retained in version history
Reorder casesNoOrder changes don't affect reproducibility
Update tags/metadataNoMetadata changes are applied immediately across versions
Change settingsNoConfiguration changes apply to all future evaluations

Version History

View all versions of a dataset and see exactly what changed between them.

Pin to Version

Lock evaluations to a specific dataset version for reproducible results.

Managing Datasets#

PromptReports provides comprehensive tools for organizing and maintaining your test datasets:

Filter & Search

Find specific test cases using tags, categories, or full-text search across inputs and outputs.

Clone Datasets

Duplicate existing datasets as a starting point for variations or experiments.

Bulk Operations

Add, remove, tag, or update multiple test cases at once with batch actions.

Dataset Settings

Configure default metrics, scoring weights, and evaluation parameters per dataset.

Export Datasets

Download datasets as CSV or JSON for backup, sharing, or external analysis.

Sync with Source

Re-import updated data from connected external sources automatically.

Each dataset can be associated with one or more prompts. You can also create "shared" datasets that apply across multiple prompts—useful when testing prompts that should handle similar types of inputs consistently.

Best Practices#

Follow these guidelines for effective test datasets:

1

Cover the Full Spectrum

Include easy, medium, and hard cases. Don't just test happy paths—include edge cases, boundary conditions, and potentially problematic inputs that might cause unexpected behavior.
2

Use Representative Data

Your test data should reflect real-world usage patterns. If 60% of production queries are simple inquiries, your dataset should have similar proportions. Use execution history to ensure representativeness.
3

Include Adversarial Cases

Add test cases specifically designed to break your prompt: ambiguous inputs, contradictory requests, attempts to bypass instructions, and inputs in unexpected formats.
4

Categorize Thoroughly

Use categories and tags liberally to enable filtered analysis. Common dimensions: category, difficulty, source, language, customer type, and domain-specific labels.
5

Document Expected Outputs

When possible, include expected outputs or at least expected characteristics. This enables automated scoring and makes it clear what "correct" looks like for each case.
6

Maintain Actively

Review and update datasets regularly. Remove outdated cases, update expected outputs when requirements change, and add new cases when you discover failure modes.

With well-structured datasets in place, you're ready to start running evaluations to measure and improve prompt quality systematically.