Agent Release Safety Gates

A public release-readiness harness for measuring AI-agent reliability across incident replay, grounded retrieval, safe refusal, approval-gated mock tools, safety/usefulness trade-offs, audit events, and observability.

This project does not reproduce, evaluate, or criticize any real company's internal AI system. All data, teams, tickets, runbooks, controlled benchmark metrics, and workflows are synthetic. TechQA and WixQA results are separate public-data RAG benchmarks.

At A Glance

What it is

A reproducible benchmark and dashboard for evaluating incident replay, grounded retrieval, safe refusal, approval-gated tools, and auditability in AI-agent workflows.

Why it matters

Agent safety work needs mitigation-aware measurement that reports both unsafe misses and usefulness costs, not only a single headline safety score.

Evidence available

Controlled synthetic operations tests are paired with public TechQA and WixQA RAG validation, safety trade-off analysis, and release gates.

Validation boundary

The controlled operations benchmark is synthetic. Independent human labels and broader multi-model comparison are the next validation steps.

Key Findings

Safety needs trade-off reporting

Unsafe-request capture is useful only when reviewed alongside benign blocks, review load, weak-evidence handling, and false negative risk.

Public RAG checks strengthen the lab

TechQA and WixQA results show that retrieval evaluation is not limited to the controlled synthetic environment.

Intervention evidence

The current study compares frozen baseline behavior against layered safeguards for prompt injection, action gating, and safety classification.

Goal-conflict arbitration

The lab now measures when agents should redirect user goals that conflict with safety, evidence, privacy, or tool-risk boundaries.

Incident replay

Redacted synthetic incidents are replayed as regression tests before a release gate can pass.

Auditability is part of reliability

The project publishes release gates, trace summaries, and generated artifacts so reviewers can inspect how results were produced.

Evidence Snapshot

Synthetic golden cases358
Red-team cases60
Intervention experiments3
Incident replay cases8
Incident closure rate100.00%
Incident gate statusPass
Manual golden-case share28.49%
Public RAG cases640
Public weighted RAG@379.92%
TechQA public RAG@380.73%
WixQA public RAG@377.50%
Moderate grounding unsupported23.53%
Strict grounding review / 10022.97
Synthetic citation coverage98.26%
Synthetic abstention accuracy100.00%
Memory pollution follow rate0.00%
Memory review / 10066.67
Goal conflict unsafe compliance0.00%
Goal conflict review / 10058.33
Safety classifier recall90.91%
High-severity unsafe misses0
Synthetic unsafe prevalence10.02%
Red-team safe response rate100.00%
Indexed observability traces21
Release gate statusPass
External review statusAwaiting independent labels

These are engineering checks over controlled benchmarks. They should be read with the benchmark cards, dataset boundaries, and full report.

Explore The Project

Interactive dashboard

Explore metrics, cases, safety analysis, retrieval comparisons, and observability views in Streamlit.

Open dashboard

Full evaluation report

Read the deeper method, metrics, limitations, and generated evaluation narrative.

Open report

Evaluate your agent

Convert generic agent logs or LangChain/LangSmith traces into candidate results, then run the incident replay release gate.

Open quickstart

Reviewer handoff

Inspect the external-review packet, labelling workflow, and reviewer-facing instructions.

Open reviewer handoff

Technical artifacts

View 64 generated JSON, CSV, report, and reproducibility artifacts on a separate technical page.

Open artifact index

Benchmark Transparency

High scores on synthetic cases are useful only when the benchmark mix is visible. This project keeps the synthetic benchmark, public RAG tracks, and remaining validation gaps separate.

Current benchmark-quality labels
  • Provider-backed embedding comparison is available as an optional credentialed run but is not published yet.
  • Real company data is intentionally excluded; synthetic and public benchmarks are reported separately.
Recommended next data work
  • Compare local retrieval with a provider-backed embedding option.
  • Keep synthetic and public benchmark data separated and reproducible.

Run Locally

The FastAPI service and dashboard are containerized so the full stack can be run from the public repository.

docker compose up --build

Then open http://localhost:8510 for the dashboard and http://localhost:8000/health for the API.