Continuous testing for agent workflows on your MCP

You used to test user flows.
Now you need to test agent flows.

Armature continuously runs the workflows agents execute through your MCP — across multiple models and coding harnesses — then tells you exactly what broke, when, and why.

Start free trial See how it works No credit card · 14-day trial

armature.dev/runs/run_7B3F920A

Pass

Triage incoming bug

17.42s

RPlan steps

1.42s

Tlinear.search_issues

2.84s

RIdentify owning team

1.18s

Tlinear.update_issue

1.92s

Jjudge.evaluate

2.21s

Judge: PASS · 0.92 claude-sonnet-4.5 · Linear MCP

Agents don't take the happy path.

Click-test frameworks were built when software was deterministic. A button either worked or it didn't. Now your product is driven by a non-deterministic agent that picks tools at runtime, hallucinates arguments, and takes a different path on Tuesday than it did on Monday.

Unit tests can't catch this. Manual QA can't keep up. You need a harness that runs your agent the way your users run it — and tells you when reality drifts from the rubric.

Selenium / Playwright era

Testsuser flows

Outputsdeterministic

VerdictDOM diff

Failure modeflaky selector

Armature era

Testsagent flows

Outputsprobabilistic

Verdictjudge + assertions

Failure modewrong tool, wrong order

How it works

From MCP to alerts in four steps

Connect once. Armature handles discovery, scheduling, evaluation, and notifications.

Connect your MCP

Point Armature at your MCP server URL. We catalog every tool, schema, and capability automatically.

MCP server URL

https://mcp.your-product.com/v1

Connected 24 tools cataloged

Suggested workflows

01Triage incoming bug → assign team

02Open issue + add reviewer

03Summarize support thread

04+ 9 more proposed

Auto-discover workflows

We crawl your product surface and tools to propose realistic workflows your users actually run. Approve, edit, or write your own.

Test across the matrix

Each workflow runs on a schedule against multiple harnesses and models. Catch the regressions you can't reproduce by hand.

Workflows × harness × model

16 combinations14 pass · 2 fail

Live alert

Success rate < 80%

Workflow Triage incoming bug dropped to 71% on claude-haiku-4.5.

Slack PagerDuty Email

Observe & get alerted

Full traces, judge verdicts, and assertion checks on every run. When success rate drops, you find out before your users do.

The test matrix

Every workflow × every harness × every model

An agent that works in Claude Code might fall over in Codex. We catch the regressions you can't reproduce by hand.

Coding harnesses

The agent runtime under test

Claude Code Codex OpenCode Cursor Agent + bring your own

Frontier models

Anthropic, OpenAI, Google

Claude (Opus, Sonnet, Haiku) GPT-5 family Gemini 3 + any provider

Testing observability

See what works. Understand what doesn't.

Every run produces a complete trace — reasoning, tool calls, judge verdict, assertions — so you can debug failures the same way you debug code.

Success rate over time

Spot regressions the moment they appear, scoped by workflow, model, or harness.

Full traces

Every reasoning step, tool call, and result.

RPlan1.42s

Tsearch_issues2.84s

Tupdate_issue1.92s

Jjudge2.21s

Assertions

Deterministic checks, every run.

✓Tool linear.assign was called

✓No tool returned an error

✗Output mentions team_id

LLM-as-judge verdicts

Plain-language rubrics get evaluated by an independent model. Every PASS or FAIL is auditable.

✓

PASS · score 0.92

Agent invoked the required tools in the correct order and the final output references the source incident.

Alerting

Know before your users do.

Set thresholds on success rate, judge score, latency, or any assertion. Get paged the way your team already gets paged.

✓Success-rate, latency, and judge-score thresholds
✓Slack, PagerDuty, email, and webhooks
✓Alert on any single workflow or org-wide
✓Snooze, mute, and incident grouping built in

#alerts

Armature: Workflow PagerDuty → Linear success rate dropped to 71% (threshold 80%) in the last hour.

2 min ago · 14 failed runs

PagerDuty · sre@acme

Judge score below 0.5 on Triage incoming bug. Last 3 runs failed assertion tool_called(linear.assign).

11 min ago · acked by Lila K.

Weekly digest

6 workflows · 1,247 runs · 92.4% success rate. 1 regression detected on claude-haiku-4.5.

Sent every Monday · 09:00

You used to test user flows. Now you need to test agent flows.