For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Dashboard
User GuideDeveloper GuidesAPI Reference
User GuideDeveloper GuidesAPI Reference
  • Getting Started
    • What is Runtype?
    • Creating your account
    • Platform Keys vs. BYOK
    • Understanding the Runtype UI
    • Quickstart: Social Media Post Generator
    • Quickstart: From Agent to Chat Widget
  • Dashboard
    • What is the Dashboard?
    • Daily Executions
  • Playground
    • What is the Playground?
  • Products & Surfaces
    • What are Products?
    • What are Surfaces?
    • Creating a Product
    • Setting up a Chat Surface
    • Setting up an API Surface
    • Setting up an MCP Surface
    • Setting up an A2A Surface
    • Setting up a Slack Surface
    • MCP authentication
    • Authenticating with product API keys
    • Embedding the chat widget (script tag)
    • Embedding the chat widget (React)
    • Surface orchestration modes
    • Product views
    • Adding Capabilities to a product
    • Connecting external agents
    • How A2A works
    • Connecting to Cursor / VS Code
    • Connecting to Claude Desktop
    • Scoping API keys to capabilities
    • Auto-generated OpenAPI spec
    • Calling your API endpoints
    • Client tokens and domain restrictions
    • AI-powered theme generation
    • Widget theming and customization
    • Product versioning and status
  • Flows
    • What are Flows?
    • Creating and Editing Flows
    • Flow step types overview
    • Agent and Flow Templates
    • Using prompt steps
    • Using transform-data steps
    • Using conditional steps
    • Using fetch-url and api-call steps
    • Using record steps (upsert/retrieve)
    • Flow variables and templates
    • Flow versioning and publishing
    • Running flows in batch
    • Handling batch failures
    • Debugging flows
  • Agents
    • What are Agents?
    • Creating and configuring Agents
    • Agent tools
  • Records
    • What are Records?
    • Creating and managing records
    • Using records in flows
    • Filtering and searching records
  • Tools
    • What are Tools?
    • Built-in Tools
    • Creating custom tools
    • Creating external tools
    • Runtime tools
  • Evals
    • What are Evals?
    • Running an Eval
    • Interpreting eval results
  • Schedules
    • What are Schedules?
    • Automating batch processing
  • Logs
    • What are Logs?
    • Working with Logs
  • Integrations
    • Connecting AI model providers
    • Slack integration
    • Google Workspace integration
    • GitHub integration
    • Linear integration
    • Weaviate (vector search)
    • Firecrawl (web scraping)
    • Exa (web search)
  • Settings
    • What's in Settings?
    • Available AI models
    • What are Organizations?
    • Managing AI models
    • Managing API keys
    • Billing and plans
    • Usage data
    • Team members and permissions
    • Appearance and preferences
    • Integrations (PostHog, Weaviate, Daytona)
  • Troubleshooting & FAQ
    • FAQ
    • Rate Limits and Usage
    • Managing Runtype with Claude
    • Flow execution failures
    • Common errors and solutions
    • Authentication issues
Dashboard
LogoLogo
On this page
  • Accessing results
  • Results overview
  • Key metrics
  • Success rate
  • Latency
  • Cost
  • Token usage
  • Detailed comparison
  • Interpreting results
  • Clear winner
  • Cost tradeoff
  • Inconsistent performance
  • All variations similar
  • Keyword analysis
  • Common patterns
  • Cheaper model is sufficient
  • Prompt matters more than model
  • Temperature affects consistency
  • Acting on results
  • Next steps
Evals

Interpreting eval results

Was this page helpful?
Previous

What are Schedules?

Next
Built with

Analyze Eval results to identify the best-performing model, prompt, or configuration for your use case.

Accessing results

  1. Go to Evals in the sidebar
  2. Find your Evaluation
  3. Click to open the results page

Results overview

The results page shows high-level metrics for each variation:

VariationSuccess RateAvg LatencyAvg Cost
claude-sonnet-4-5100%2.1s$0.015
gpt-5-mini95%0.9s$0.003
gemini-3-flash100%1.2s$0.008

Key metrics

Success rate

Percentage of executions that completed without errors.

  • 100% is ideal
  • Investigate failed cases before deploying that variation

Latency

Average response time per execution.

  • Faster models improve user experience
  • Balance speed against output quality
  • Some models are consistently faster

Cost

Average cost per execution based on token usage.

  • Important for high-volume use cases
  • Cheaper models may be sufficient for some tasks
  • Multiply by expected monthly volume to estimate budget

Token usage

Prompt and completion token counts per variation. Helps identify which configurations consume more input or output tokens.

Detailed comparison

Click Compare to see side-by-side outputs for each test case:

  1. Select a test case
  2. View outputs from all variations
  3. See latency and cost for each
  4. Identify patterns in which variation performs better

Winner indicators highlight which variation had the highest success rate, lowest cost, and fastest execution.

Interpreting results

Clear winner

One variation has the best success rate with acceptable latency and cost:

  • Action: Deploy that variation

Cost tradeoff

A higher-quality model costs significantly more than a cheaper alternative:

  • Action: Decide based on use case — premium features use the expensive model, basic features use the cheap model

Inconsistent performance

A variation scores well on some cases but poorly on others:

  • Action: Investigate which types of inputs cause poor performance. Refine the prompt or add conditional logic.

All variations similar

Metrics are close across all variations:

  • Action: Choose the cheapest or fastest option

Keyword analysis

Use keyword analysis across step outputs to spot language patterns. Search for specific terms to see which variations use particular phrasing more or less often. This helps compare tone, terminology, and completeness across configurations.

Common patterns

Cheaper model is sufficient

gpt-5-mini achieves comparable success rate to claude-sonnet-4-5, but costs a fraction of the price:

  • Decision: Use gpt-5-mini for most requests, claude-sonnet-4-5 only for complex cases

Prompt matters more than model

Same model with different prompts shows a large difference in output quality:

  • Decision: Focus on prompt engineering before switching models

Temperature affects consistency

Lower temperature produces more deterministic outputs with less variance:

  • Decision: Use low temperature for tasks requiring consistency (classification, extraction)

Acting on results

After identifying the best configuration:

  1. Update your Flow with the winning model and settings
  2. Monitor production logs to confirm improvement

Next steps

  • Running an evaluation to execute Evals
  • What are evals? for conceptual background
  • What are Logs? to monitor production performance