Analyze Eval results to identify the best-performing model, prompt, or configuration for your use case.
Accessing results
- Go to Evals in the sidebar
- Find your Evaluation
- Click to open the results page
Results overview
The results page shows high-level metrics for each variation:
Key metrics
Success rate
Percentage of executions that completed without errors.
- 100% is ideal
- Investigate failed cases before deploying that variation
Latency
Average response time per execution.
- Faster models improve user experience
- Balance speed against output quality
- Some models are consistently faster
Cost
Average cost per execution based on token usage.
- Important for high-volume use cases
- Cheaper models may be sufficient for some tasks
- Multiply by expected monthly volume to estimate budget
Token usage
Prompt and completion token counts per variation. Helps identify which configurations consume more input or output tokens.
Detailed comparison
Click Compare to see side-by-side outputs for each test case:
- Select a test case
- View outputs from all variations
- See latency and cost for each
- Identify patterns in which variation performs better
Winner indicators highlight which variation had the highest success rate, lowest cost, and fastest execution.
Interpreting results
Clear winner
One variation has the best success rate with acceptable latency and cost:
- Action: Deploy that variation
Cost tradeoff
A higher-quality model costs significantly more than a cheaper alternative:
- Action: Decide based on use case — premium features use the expensive model, basic features use the cheap model
A variation scores well on some cases but poorly on others:
- Action: Investigate which types of inputs cause poor performance. Refine the prompt or add conditional logic.
All variations similar
Metrics are close across all variations:
- Action: Choose the cheapest or fastest option
Keyword analysis
Use keyword analysis across step outputs to spot language patterns. Search for specific terms to see which variations use particular phrasing more or less often. This helps compare tone, terminology, and completeness across configurations.
Common patterns
Cheaper model is sufficient
gpt-5-mini achieves comparable success rate to claude-sonnet-4-5, but costs a fraction of the price:
- Decision: Use gpt-5-mini for most requests, claude-sonnet-4-5 only for complex cases
Prompt matters more than model
Same model with different prompts shows a large difference in output quality:
- Decision: Focus on prompt engineering before switching models
Temperature affects consistency
Lower temperature produces more deterministic outputs with less variance:
- Decision: Use low temperature for tasks requiring consistency (classification, extraction)
Acting on results
After identifying the best configuration:
- Update your Flow with the winning model and settings
- Monitor production logs to confirm improvement
Next steps
- Running an evaluation to execute Evals
- What are evals? for conceptual background
- What are Logs? to monitor production performance