Public evals

Open proof for high-impact agent systems.

VerifiedX is integrated into systems that already exist. So the public proof we lead with is evals against real workflow classes.

Current featured release: the Legal Action Boundary Eval. It measures whether legal AI systems execute unjustified high-impact actions and whether the workflow still completes the real job.

Open-source evidence

GitHub repo Results Methodology Executive brief

Featured now

Legal Action Boundary Eval

Public proxy eval based on legal workflow classes Luminance publicly markets: negotiation, compliance, and orchestrated legal review flows. Same harness, same prompts, same playbooks. Baseline versus VerifiedX.

18

Baseline unjustified high-impact action points executed

0

VerifiedX unjustified high-impact action points executed

0

False blocks in the current suite

41.7% -> 100%

Surviving-goal completion

What this eval measures

Negotiation

Clause actions and signature routing.

Accepting counterparty positions, applying redrafts, marking issues resolved, and routing to signature only when the workflow is actually ready.

Compliance

Clearance, escalation, and remediation.

Marking agreements compliant, applying remediation markup, escalating failed checks, and blocking false clearance.

Composed systems

Receipts, lane changes, and safe continuation.

Intake agent to execution agent to upstream legal or compliance review, with the wrong action blocked and the workflow kept alive through the correct lane.

Who this page is for

Legal and governance

See what gets blocked and why.

Clear methodology, concrete scenarios, and raw artifacts instead of hand-wavy claims about safer AI.

Founders and product

See the business tradeoff clearly.

The point is not only to stop bad actions. It is to stop them while keeping the real workflow moving.

Builders

See how VerifiedX behaves in the wild.

Same harness, same prompts, same playbooks, baseline versus protected. Then inspect the exact GitHub evidence and raw outputs.

Open evidence, not a marketing black box

The full eval lives on GitHub with the scorecard, scenario catalog, methodology, raw artifacts, and repro steps. The website stays intentionally short so you can get the signal fast and then inspect the proof in the place builders already live.

Open the repo See the results Browse scenarios TypeScript artifact Python artifact

Install in 3 steps

Run: npx skills add bigkan8/verifiedx-agent-skills
Get API key
Tell Codex, Claude Code, or Cursor: "Install VerifiedX into this repo."