Interpreting eval results

Analyze Eval results to identify the best-performing model, prompt, or configuration for your use case.

Accessing results

Go to Evals in the sidebar
Find your Evaluation
Click to open the results page

Results overview

The results page shows high-level metrics for each variation:

Variation	Success Rate	Avg Latency	Avg Cost
claude-sonnet-4-5	100%	2.1s	$0.015
gpt-5-mini	95%	0.9s	$0.003
gemini-3-flash	100%	1.2s	$0.008

Key metrics

Success rate

Percentage of executions that completed without errors.

100% is ideal
Investigate failed cases before deploying that variation

Latency

Average response time per execution.

Faster models improve user experience
Balance speed against output quality
Some models are consistently faster

Cost

Average cost per execution based on token usage.

Important for high-volume use cases
Cheaper models may be sufficient for some tasks
Multiply by expected monthly volume to estimate budget

Token usage

Prompt and completion token counts per variation. Helps identify which configurations consume more input or output tokens.

Detailed comparison

Click Compare to see side-by-side outputs for each test case:

Select a test case
View outputs from all variations
See latency and cost for each
Identify patterns in which variation performs better

Winner indicators highlight which variation had the highest success rate, lowest cost, and fastest execution.

Interpreting results

Clear winner

One variation has the best success rate with acceptable latency and cost:

Action: Deploy that variation

Cost tradeoff

A higher-quality model costs significantly more than a cheaper alternative:

Action: Decide based on use case — premium features use the expensive model, basic features use the cheap model

Inconsistent performance

A variation scores well on some cases but poorly on others:

Action: Investigate which types of inputs cause poor performance. Refine the prompt or add conditional logic.

All variations similar

Metrics are close across all variations:

Action: Choose the cheapest or fastest option

Keyword analysis

Use keyword analysis across step outputs to spot language patterns. Search for specific terms to see which variations use particular phrasing more or less often. This helps compare tone, terminology, and completeness across configurations.

Common patterns

Cheaper model is sufficient

gpt-5-mini achieves comparable success rate to claude-sonnet-4-5, but costs a fraction of the price:

Decision: Use gpt-5-mini for most requests, claude-sonnet-4-5 only for complex cases

Prompt matters more than model

Same model with different prompts shows a large difference in output quality:

Decision: Focus on prompt engineering before switching models

Temperature affects consistency

Lower temperature produces more deterministic outputs with less variance:

Decision: Use low temperature for tasks requiring consistency (classification, extraction)

Acting on results

After identifying the best configuration:

Update your Flow with the winning model and settings
Monitor production logs to confirm improvement

Next steps

Running an evaluation to execute Evals
What are evals? for conceptual background
What are Logs? to monitor production performance