Agent Release Safety Gates
A public release-readiness harness for measuring AI-agent reliability across incident replay, grounded retrieval, safe refusal, approval-gated mock tools, safety/usefulness trade-offs, audit events, and observability.
This project does not reproduce, evaluate, or criticize any real company's internal AI system. All data, teams, tickets, runbooks, controlled benchmark metrics, and workflows are synthetic. TechQA and WixQA results are separate public-data RAG benchmarks.
At A Glance
What it is
A reproducible benchmark and dashboard for evaluating incident replay, grounded retrieval, safe refusal, approval-gated tools, and auditability in AI-agent workflows.
Why it matters
Agent safety work needs mitigation-aware measurement that reports both unsafe misses and usefulness costs, not only a single headline safety score.
Evidence available
Controlled synthetic operations tests are paired with public TechQA and WixQA RAG validation, safety trade-off analysis, and release gates.
Validation boundary
The controlled operations benchmark is synthetic. Independent human labels and broader multi-model comparison are the next validation steps.
Key Findings
Safety needs trade-off reporting
Unsafe-request capture is useful only when reviewed alongside benign blocks, review load, weak-evidence handling, and false negative risk.
Public RAG checks strengthen the lab
TechQA and WixQA results show that retrieval evaluation is not limited to the controlled synthetic environment.
Intervention evidence
The current study compares frozen baseline behavior against layered safeguards for prompt injection, action gating, and safety classification.
Goal-conflict arbitration
The lab now measures when agents should redirect user goals that conflict with safety, evidence, privacy, or tool-risk boundaries.
Incident replay
Redacted synthetic incidents are replayed as regression tests before a release gate can pass.
Auditability is part of reliability
The project publishes release gates, trace summaries, and generated artifacts so reviewers can inspect how results were produced.
Evidence Snapshot
These are engineering checks over controlled benchmarks. They should be read with the benchmark cards, dataset boundaries, and full report.
Explore The Project
Interactive dashboard
Explore metrics, cases, safety analysis, retrieval comparisons, and observability views in Streamlit.
Full evaluation report
Read the deeper method, metrics, limitations, and generated evaluation narrative.
Evaluate your agent
Convert generic agent logs or LangChain/LangSmith traces into candidate results, then run the incident replay release gate.
Reviewer handoff
Inspect the external-review packet, labelling workflow, and reviewer-facing instructions.
Technical artifacts
View 64 generated JSON, CSV, report, and reproducibility artifacts on a separate technical page.
Benchmark Transparency
High scores on synthetic cases are useful only when the benchmark mix is visible. This project keeps the synthetic benchmark, public RAG tracks, and remaining validation gaps separate.
Current benchmark-quality labels
- Provider-backed embedding comparison is available as an optional credentialed run but is not published yet.
- Real company data is intentionally excluded; synthetic and public benchmarks are reported separately.
Recommended next data work
- Compare local retrieval with a provider-backed embedding option.
- Keep synthetic and public benchmark data separated and reproducible.
Run Locally
The FastAPI service and dashboard are containerized so the full stack can be run from the public repository.
docker compose up --build
Then open http://localhost:8510 for the dashboard and
http://localhost:8000/health for the API.