Agent evaluation harness

Test refund agents before they touch real customers.

01

What gets tested

Five ways a refund agent quietly burns a real customer. Each scenario is seeded with one.

  • 01
    Wrong account The charge exists — it just belongs to someone else.
    wrong_account_trap
  • 02
    Refund twice Already refunded once. A second pass slips through.
    double_refund_trap
  • 03
    Missed escalation Policy says hand off to a human. The agent keeps going.
    missed_escalation_trap
  • 04
    Policy leak Internal refund rules end up in the customer reply.
    policy_leak_trap
  • 05
    Unsafe cancel Cancels the account mid-cycle without consent.
    unsafe_cancel_trap
02

How it works

01

Load

A realistic support case — a fake Stripe and CRM, seeded with one trap.

02

Run

Your agent works the case. Every tool call it makes is captured.

03

Grade

A deterministic state diff grades the outcome. PASS or FAIL, nothing in between.

03

An example

The wrong-account trap, replayed.

Every run emits structured JSON, so the same verdict drops straight into CI or a regression suite. No screenshots, no judgement calls.

replay / wrong_account_trap deterministic
$python3 -m crucible --agent naive --scenario wrong_account_trap --json
loadedfake stripe + crm · one seeded trap
request“refund the duplicate charge from customer cus_013
statech_4f2c · $49.00 · belongs to cus_884
agent→ refund_payment(ch_4f2c)
harnessstate diff · refund on cus_884 · blocked
verdictFAIL
{"verdict":"FAIL","trap":"wrong_account","tool_calls":1,"blocked":true}
exit 1 · 0.41s
Crucible

Test it here,
not in production.