Paper 4 · Historical replay

When documented AI failures meet a frozen non-compensatory gate set

Monorepo root commit
Not recorded in the public portfolio system_snapshot.json (v1.2, 2026-04-11T07:37:21Z) used for this binding. Not invented on this page.
Tier-0 shared-core commit (portfolio snapshot)
cd9ad79fe16f34ad861bd6527670dcfbef8fe864
Paper 4 repository commit (released)
061744534c14872268bae3d596511a7ea0ec9081
Zenodo DOI
https://doi.org/10.5281/zenodo.19388835
Release version
v2.0.0 (portfolio release designation); CITATION.cff may list package version 1.0.0 — treat commit + DOI as authoritative if they diverge.
Page generated (UTC)
2026-04-12

Executive overview

Problem: Policy documents describe trustworthy AI principles, but teams still lack transparent, replayable rules for “stop vs go” that do not let one strong score mask a failing domain.

Why it matters: After high-profile system failures, leaders need to know whether a declared governance rule set would have blocked deployment ex ante—not a narrative after the fact. Historical replay answers that question for a fixed, pre-specified engine.

Core insight

A conjunctive, non-compensatory gate architecture changes which deployments pass compared with compensatory scoring—even when both use similar underlying measurements. The gap is empirically visible on real public failure cases, not only in toy examples.

What was done

Researchers encoded twelve well-documented failure cases (2014–2021) across seven sectors using a structured supply-chain-style feature model with explicit provenance classes (direct, derived, imputed, uncertain). A versioned overlay reconciles “core-equivalent” legacy rows to expanded evidence. A frozen five-gate engine (four threshold profiles, including “moderate”) evaluates each case; a parallel compensatory comparator scores the same evidence with offsetting allowed.

The study adds regulatory-cleared control devices, expanded benchmark tiers, Monte Carlo bands on confidence, perturbation sweeps, dual-schema invariance checks, and single- versus pairwise-gate ablations to test whether rejections are fragile artefacts of one gate.

What was found

Under the moderate profile on the twelve core failures, eleven are rejected and one is a false negative (narrow safety margin). All twelve control devices approve, supporting specificity. The safety gate binds most often; bias, calibration, and traceability failures appear repeatedly; evidence depth binds less often but matters in ablations.

Layer discipline: headline confusion-matrix statistics for the twelve-failure presentation do not match the single exported confusion-matrix file, which encodes a different evaluation layer (expanded failures plus controls). Both are documented; mixing them invalidates interpretation (P4-C10). One quantity in the sensitivity narrative shows a small manuscript–log discrepancy; the traceability row for P4-C21 preserves the engineer flag.

Removing any one gate never flips a rejection; selected pairs do—showing conjunctive structure. Compensatory scoring agrees on most failures but would approve two notable cases the non-compensatory engine rejects. Expanded benchmarks can show perfect separation under the stated encoding—a structural observation, not a claim of prospective predictive validation (P4-C25).

Why this matters for regulation, safety, and deployment

  • Procurement and assurance: buyers can demand replayable gate evidence instead of narrative compliance alone.
  • Risk committees: conjunctive rules surface multi-domain weaknesses early; compensatory models may hide them.
  • Post-incident review: the same frozen engine can be applied consistently across cases for audit defensibility.

Limitations and ethics

Convenience sample; heterogeneous provenance; selection and survivorship bias; not representative of all jurisdictions or deployments. Retrospective computational study on public materials only—no human subjects. Assistive tools may support documentation; scientific judgments remain author-owned (P4-C35).

Audit posture: Independent QA (2026-04-12) passed pytest, full reproduce_all.py, and strict output validation for this pin. Traceability rows preserve engineer flags such as partially verified quantities where logs and manuscript differ (P4-C21).

View technical detail — notebook walkthrough (conceptual)

No code is shown here. In the repository, four notebooks execute in order:

01 — Dataset intake Builds inventories, provenance summaries, and dataset statistics that ground encoding claims (P4-C03–C06, P4-C19).
02 — Replay execution Runs the frozen engine on core, control, and expanded layers; records verdicts, compensation comparison, and assertions for headline metrics (P4-C07–C18, P4-C22–C24).
03 — Metrics & calibration Monte Carlo stability, sensitivity sweeps, invariance and perturbation regimes (P4-C20–C29).
04 — Figures & tables Publication-style charts for gate failures, ablations, provenance, and compensation (P4-C36–C39).

Closing the box does not hide any conclusion above; it adds methodological depth for technical readers.

Full claim traceability (P4-C01–P4-C40)

One-to-one with docs/claim_traceability.md in the pinned repository. Status labels are textual (never colour-only).

Claim ID Claim (paraphrase) Code Notebook Output / path Status
P4-C01Institutional frameworks (NIST AI RMF, EU AI Act, ISO/IEC 23894) guide process but do not operationalise deterministic threshold gate logic.N/A (background)01 (narrative)narrative
P4-C02Historical replay applies a pre-specified five-gate non-compensatory engine; engine not modified for this analysis.engine/corrected_public_engine_v1_1.py02outputs/logs/replay_run_log.txtimplemented
P4-C0312 documented failure cases (2014–2021), seven sectors, convenience sample from strong documentary evidence.data/canonical/canonical_dataset.json01outputs/tables/dataset_inventory.csvimplemented
P4-C0464 independent documentary sources; 15 SCM features per case; rubric-based encoding with provenance classes.canonical + EEE overlay01outputs/tables/dataset_inventory.csvimplemented
P4-C05Triangulation yields 57 triangulated and 123 passthrough features; timing: 4 pre-deployment vs 60 post-incident sources.EEE / provenance JSON01notebook stdout / inventoryimplemented
P4-C0620 declared feature-dependency overlaps (consistent with dependency matrix).provenance artefacts01docs/provenance.mddocumented
P4-C07Five non-compensatory gates + four threshold profiles + parallel compensatory comparator.corrected_public_engine_v1_1.py02–04outputs/tables/replay_results.csvimplemented
P4-C08Eight core-equivalent failures upgraded via EEE overlay (v1.2.0): 16 features upgraded, 8 imputations removed, 16 confidence uplifts; two cases LOW→MODERATE.engine/eee_overlay_adapter.py, overlay JSON01–02config/core_equivalent_cases.jsonimplemented
P4-C0912 FDA-cleared control devices for specificity (Extended Data 5).benchmark cases02outputs/tables/replay_results.csv (layer rows)implemented
P4-C10Manuscript headline 12 failures + 12 controls under moderate: TP=11, FN=1, TN=12, FP=0; sensitivity 0.917, specificity 1.000. Repo: Layer 1 (12 failures) matches 11/12 rejections and sole approve google_dr; 12 FDA controls all approve in notebook Layer 2 block. confusion_matrix.csv encodes Layer 2 only (20 failures + 12 controls): TP=20, FN=0, TN=12, FP=0 — not the 12+12 headline matrix.engine02notebooks/02_historical_replay_execution.ipynb (§2.2, §2.6); outputs/tables/confusion_matrix.csv (Layer 2)implemented; VERIFIED (QA 2026-04-12) — layer split / artefact scope
P4-C11Sole false negative: google_dr (narrow margins; safety margin 0.05).engine02–03notebooks/02_historical_replay_execution.ipynb (assertions + stdout)implemented; VERIFIED (QA 2026-04-12)
P4-C12Safety gate binding most often: 10/12 (83%) failures under moderate.engine02, 04outputs/figures/figure1_gate_failure.png; notebook 02 stdoutimplemented; VERIFIED (QA 2026-04-12)
P4-C13Bias gate 6/12; calibration and traceability each 5/12; evidence gate 3/12.engine04figure1_gate_failure.png; notebook 02 stdoutimplemented; VERIFIED (QA 2026-04-12)
P4-C14Rejected cases average 2.6 gate failures each (29 total across 11 rejections).engine03–04notebook 02 stdout (mean 2.636, total 29)implemented; VERIFIED (QA 2026-04-12) (mean rounded in manuscript)
P4-C15Ablation: removing any single gate does not flip a rejection to approval (12 cases).engine04figure2_ablation.png, ablation_matrix.csvimplemented
P4-C16Pairwise ablation: safety+bias flips Optum, Gender Shades, UK A-levels; safety+calibration flips Google Flu & Uber AV; evidence+traceability flips Babylon.engine04figure2_ablation.pngimplemented
P4-C17Non-compensatory vs compensatory agree on 10/12; two divergences (google_flu, uber_av) where compensatory would approve.engine02–04figure4_compensation.png; notebook 02 stdout / assertionsimplemented; VERIFIED (QA 2026-04-12)
P4-C18Compensatory scores at divergence: Google Flu ~0.57 (threshold 0.50); Uber AV ~0.51 (threshold 0.50).engine02notebook 02 stdout (0.5675, 0.5125)implemented; VERIFIED (QA 2026-04-12)
P4-C19Provenance mix across 180 encodings: 27.2% direct, 37.8% rule-derived, 28.3% imputed, 6.7% uncertain; mean confidence 0.591.canonical features01, 04figure3_provenance_stability.pngimplemented
P4-C20Monte Carlo on [low, high] bands: 200 iterations, seed 4212/12 outcome-stable under moderate.engine03outputs/figures/calibration_summary.txt; outputs/tables/metrics_summary.csvimplemented; VERIFIED (QA 2026-04-12)
P4-C21±0.20 sensitivity: only google_dr shows flip points (8 across 4 features); 11 rejections robust.engine03metrics_summary.csv; calibration_summary.txtimplemented; partially verified (QA 2026-04-12) — flip-point count 7 in calibration_summary.txt / metrics vs manuscript 8
P4-C22Expanded benchmark 91 cases (61 failures, 30 controls): 100% sensitivity & specificity; no misclassifications under moderate.benchmark dir + engine02replay_results.csv; outputs/logs/replay_run_log.txtimplemented; VERIFIED (QA 2026-04-12)
P4-C23Tier 2 (49 cases): lower mean confidence (~0.383), 4.0 mean gate failures, safety binding 100%.benchmark metadata01–02dataset inventory / replayimplemented
P4-C24Tier 3 (30 FDA-authorised devices): all APPROVE under moderate (specificity controls).benchmark02replay_results expanded rowsimplemented
P4-C25Perfect separation on expanded set is a structural consequence of encoding + non-compensatory logic, not prospective validation.N/A (interpretive)02 (markdown)narrative
P4-C26Dual dataset structural invariance: normalised public schema vs canonical → 480/480 field comparisons identical (12×4×10).engine03outputs/tables/metrics_summary.csv (invariance,fields_matched,480)implemented; VERIFIED (QA 2026-04-12)
P4-C27Replay vs canonical full mode: under moderate, zero verdict change on 12 cases.engine modes03metrics_summary.csv (mode_sensitivity,moderate_divergences,0)implemented; VERIFIED (QA 2026-04-12)
P4-C28Controlled ±0.05 perturbation on calibration/bias/traceability: 46/48 verdict-stable in replay; 2 flips permissive only (Epic Sepsis, Babylon).data/canonical/perturbation_dataset.json03metrics_summary.csv (perturbation,verdicts_stable,46; verdict_flips,2)implemented; VERIFIED (QA 2026-04-12)
P4-C29Moderate profile: zero verdict flips under that perturbation regime; compensatory Uber AV can flip near 0.50.engine03metrics_summary.csv (moderate_flips,0); Uber AV score in notebook 02implemented; VERIFIED (QA 2026-04-12) (compensatory clause via 02)
P4-C30Three primary discriminative gates under expanded + core scopes: G1, G4, G5 (manuscript framing).engine02–04gate failure chartsimplemented
P4-C31Contribution: reproducible deterministic pipeline, historical replay methodology, empirical defence-in-depth via redundant gates.N/Aall notebooksreproduce_all.pyprocess
P4-C32Limitations: convenience sample, tiered provenance heterogeneity, selection/survivorship bias, not representative of all deployments.N/A01–02 textnarrative
P4-C33Ethics: retrospective computational study on public cases; no human subjects; no ethics approval required.N/Amanuscript onlynarrative
P4-C34Data/code availability: Zenodo DOI placeholder; GitHub URLs cited (external).N/Arepro_manifest.jsoninputs/external
P4-C35AI disclosure: Claude used for code/docs assistance; author retains scientific decisions.N/Amanuscript onlynarrative
P4-C36Figure 1: gate failure counts (83% safety, 50% bias, 42% cal/trace, 25% evidence).engine04outputs/figures/figure1_gate_failure.png; notebook 02 gate stdoutimplemented; VERIFIED (QA 2026-04-12)
P4-C37Figure 2: single-gate removal stable; pairwise removals as specified.engine04outputs/figures/figure2_ablation.png; outputs/tables/ablation_matrix.csvimplemented; VERIFIED (QA 2026-04-12) (artefacts present + pipeline PASS)
P4-C38Figure 3: provenance distribution + MC stability + ±0.20 sensitivity summary.notebooks04outputs/figures/figure3_provenance_stability.pngimplemented; VERIFIED (QA 2026-04-12) (figure + validation PASS)
P4-C39Figure 4: compensation effect highlighting google_flu and uber_av.engine04outputs/figures/figure4_compensation.pngimplemented; VERIFIED (QA 2026-04-12)
P4-C40PhysioNet / extended analyses referenced as Extended Data (not re-proven in-repo here).external03 (refs)docs/provenance.mdout-of-notebook scope

For independent verification, run the automated test suite and the full reproduction driver from the repository root exactly as documented in that release’s README and traceability file. Executable commands are intentionally omitted from this public summary page.