For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
Dashboard
User GuideDeveloper GuidesAPI Reference
User GuideDeveloper GuidesAPI Reference
  • Getting Started
    • What is Runtype?
    • Creating your account
    • Platform Keys vs. BYOK
    • Understanding the Runtype UI
    • Quickstart: Social Media Post Generator
    • Quickstart: From Agent to Chat Widget
  • Dashboard
    • What is the Dashboard?
    • Daily Executions
  • Playground
    • What is the Playground?
  • Products & Surfaces
    • What are Products?
    • What are Surfaces?
    • Creating a Product
    • Setting up a Chat Surface
    • Setting up an API Surface
    • Setting up an MCP Surface
    • Setting up an A2A Surface
    • Setting up a Slack Surface
  • Flows
    • What are Flows?
    • Creating and Editing Flows
    • Flow step types overview
    • Agent and Flow Templates
  • Agents
    • What are Agents?
    • Creating and configuring Agents
    • Agent tools
  • Records
    • What are Records?
    • Creating and managing records
  • Tools
    • What are Tools?
    • Built-in Tools
    • Creating custom tools
    • Creating external tools
  • Evals
    • What are Evals?
    • Running an Eval
  • Schedules
    • What are Schedules?
  • Logs
    • What are Logs?
  • Integrations
    • Connecting AI model providers
  • Settings
    • What's in Settings?
    • Available AI models
  • Troubleshooting & FAQ
    • FAQ
    • Rate Limits and Usage
    • Managing Runtype with Claude
Dashboard
LogoLogo
On this page
  • Start an Eval
  • Choose an execution mode
  • Realtime
  • Batch
  • Set up configurations
  • What you can override
  • Example setup
  • Understand your results
  • Realtime results
  • Batch results
  • Tips for getting the most out of Evals
  • Eval limits
  • Next steps
Evals

Running an Eval

Was this page helpful?
Previous

What are Schedules?

Next
Built with

Run Evals to compare model and prompt variations across test data, then choose the best-performing configuration for your Flow.

If you are new to Evals, start with What are Evals? for the conceptual overview.

Start an Eval

  1. Open the Flow you want to evaluate.
  2. Click Run.
  3. Select the Eval tab.
  4. Choose an execution mode, set up your configurations, and click Run Eval.

Choose an execution mode

Evals support two execution modes depending on how you want to test.

Realtime

Realtime runs your Eval immediately with streaming results. Use it for quick comparisons when you want to enter test input directly and watch outputs as they stream.

  • Enter messages for chat-based Flows or variables for other Flows.
  • Review results in real time so you can compare outputs immediately.

Batch

Batch queues your Eval to run against all Records of a selected type. Use it when you want to test at scale with data you already collected.

  • Select a Record type from your existing Records.
  • The Eval runs your Flow once per Record for each configuration.
  • Batch Evals run in the background, so you can leave the page and come back later.

On build and trial plans, Batch Evals are limited to the first 10 Records so you can validate your setup before running a full comparison. Paid plans can process up to 100 Records per Batch Eval. If you need to prepare test data first, see Creating and managing records.

Set up configurations

Configurations are the variations you want to compare. Each configuration can override settings on any prompt step in your Flow. Runtype labels them with letter badges such as A, B, and C so you can compare results more easily.

Give each configuration a clear name such as Baseline, Budget Option, or Creative Mode.

What you can override

  • Model — Compare model choices such as claude-sonnet-4-5, gpt-5-mini, or gemini-3-flash.
  • Temperature — Test more deterministic or more creative outputs.
  • Max tokens — Control response length.
  • Response format — Switch between JSON, markdown, XML, or HTML output.
  • Reasoning — Enable or adjust extended thinking for supported models.
  • Tools — Add, remove, or change which tools are available to the step.

Example setup

ConfigurationModelTemperature
A - Baselineclaude-sonnet-4-50.7
B - Budgetgpt-5-mini0.7
C - Premiumclaude-opus-4-50.3

This setup runs your Flow once per configuration for each test case so you can compare the tradeoffs directly.

Understand your results

Realtime results

After a Realtime Eval completes, you will see a step-by-step comparison table with each configuration’s output, model, duration, and cost.

Batch results

Batch results are available on the Evals page. Evals in the same group are linked so you can compare them together.

Click Compare to open the comparison view, which includes:

  • Winner cards — Highlights which configuration had the highest success rate, lowest cost, and fastest execution.
  • Metrics table — Sortable by success rate, average duration, total cost, token usage, and step counts.
  • Record-level drill-down — Open individual Records to review step-by-step outputs across configurations.
  • Step analysis — Keyword analysis across step outputs to spot patterns.

Tips for getting the most out of Evals

  • Start with Realtime — Use a quick Realtime Eval to validate your configurations before you run a full Batch Eval.
  • Include a baseline — Add your current production configuration so you can measure improvement.
  • Use representative data — For Batch Evals, include common cases and edge cases in your Records.
  • Name configurations clearly — Clear labels make results easier to scan later.
  • Compare one variable at a time — If you change both the model and the temperature, it is harder to explain the result.

Eval limits

Each plan includes a daily Eval limit to help manage usage. If you run Evals often, check your plan limits in Settings. Each Eval submission counts as one Eval against your daily limit, regardless of how many configurations or Records are included.

Next steps

  • What are Evals?
  • Creating and managing records
  • Creating and editing Flows