Continuous testing for agent workflows on your MCP

You used to test user flows.
Now you need to test agent flows.

Armature continuously runs the workflows agents execute through your MCP — across multiple models and coding harnesses — then tells you exactly what broke, when, and why.

Start free trial See how it works No credit card · 14-day trial
armature.dev/runs/run_7B3F920A
Pass
Triage incoming bug
17.42s
RPlan steps
1.42s
Tlinear.search_issues
2.84s
RIdentify owning team
1.18s
Tlinear.update_issue
1.92s
Jjudge.evaluate
2.21s
Runs against every major harness and frontier model
Claude Code Codex OpenCode Cursor Agent

Agents don't take the happy path.

Click-test frameworks were built when software was deterministic. A button either worked or it didn't. Now your product is driven by a non-deterministic agent that picks tools at runtime, hallucinates arguments, and takes a different path on Tuesday than it did on Monday.

Unit tests can't catch this. Manual QA can't keep up. You need a harness that runs your agent the way your users run it — and tells you when reality drifts from the rubric.

Selenium / Playwright era
Testsuser flows
Outputsdeterministic
VerdictDOM diff
Failure modeflaky selector
Armature era
Testsagent flows
Outputsprobabilistic
Verdictjudge + assertions
Failure modewrong tool, wrong order
How it works

From MCP to alerts in four steps

Connect once. Armature handles discovery, scheduling, evaluation, and notifications.

Connect your MCP

Point Armature at your MCP server URL. We catalog every tool, schema, and capability automatically.

1
MCP server URL
https://mcp.your-product.com/v1
Connected 24 tools cataloged
Suggested workflows
01Triage incoming bug → assign team
02Open issue + add reviewer
03Summarize support thread
04+ 9 more proposed
2

Auto-discover workflows

We crawl your product surface and tools to propose realistic workflows your users actually run. Approve, edit, or write your own.

Test across the matrix

Each workflow runs on a schedule against multiple harnesses and models. Catch the regressions you can't reproduce by hand.

3
Workflows × harness × model
16 combinations14 pass · 2 fail
Live alert
!
Success rate < 80%
Workflow Triage incoming bug dropped to 71% on claude-haiku-4.5.
Slack PagerDuty Email
4

Observe & get alerted

Full traces, judge verdicts, and assertion checks on every run. When success rate drops, you find out before your users do.

The test matrix

Every workflow × every harness × every model

An agent that works in Claude Code might fall over in Codex. We catch the regressions you can't reproduce by hand.

Coding harnesses
The agent runtime under test
Claude Code Codex OpenCode Cursor Agent + bring your own
Frontier models
Anthropic, OpenAI, Google
Claude (Opus, Sonnet, Haiku) GPT-5 family Gemini 3 + any provider
Testing observability

See what works. Understand what doesn't.

Every run produces a complete trace — reasoning, tool calls, judge verdict, assertions — so you can debug failures the same way you debug code.

Success rate over time

Spot regressions the moment they appear, scoped by workflow, model, or harness.

incident

Full traces

Every reasoning step, tool call, and result.

RPlan1.42s
Tsearch_issues2.84s
Tupdate_issue1.92s
Jjudge2.21s

Assertions

Deterministic checks, every run.

Tool linear.assign was called
No tool returned an error
Output mentions team_id

LLM-as-judge verdicts

Plain-language rubrics get evaluated by an independent model. Every PASS or FAIL is auditable.

PASS · score 0.92
Agent invoked the required tools in the correct order and the final output references the source incident.
Alerting

Know before your users do.

Set thresholds on success rate, judge score, latency, or any assertion. Get paged the way your team already gets paged.

  • Success-rate, latency, and judge-score thresholds
  • Slack, PagerDuty, email, and webhooks
  • Alert on any single workflow or org-wide
  • Snooze, mute, and incident grouping built in
#
#alerts
Armature: Workflow PagerDuty → Linear success rate dropped to 71% (threshold 80%) in the last hour.
2 min ago · 14 failed runs
PD
PagerDuty · sre@acme
Judge score below 0.5 on Triage incoming bug. Last 3 runs failed assertion tool_called(linear.assign).
11 min ago · acked by Lila K.
Weekly digest
6 workflows · 1,247 runs · 92.4% success rate. 1 regression detected on claude-haiku-4.5.
Sent every Monday · 09:00

Ship agents you can trust.

Connect your MCP in 60 seconds. Watch Armature discover, test, and monitor your agents around the clock.

Start free trial Book a demo
14 days free · No credit card · SOC 2 Type II