Paper 4 · Historical replay

When documented AI failures meet a frozen non-compensatory gate set

Monorepo root commit: Not recorded in the public portfolio system_snapshot.json (v1.2, 2026-04-11T07:37:21Z) used for this binding. Not invented on this page.
Tier-0 shared-core commit (portfolio snapshot): cd9ad79fe16f34ad861bd6527670dcfbef8fe864
Paper 4 repository commit (released): 061744534c14872268bae3d596511a7ea0ec9081
Zenodo DOI: https://doi.org/10.5281/zenodo.19388835
Release version: v2.0.0 (portfolio release designation); CITATION.cff may list package version 1.0.0 — treat commit + DOI as authoritative if they diverge.
Page generated (UTC): 2026-04-12

Executive overview

Problem: Policy documents describe trustworthy AI principles, but teams still lack transparent, replayable rules for “stop vs go” that do not let one strong score mask a failing domain.

Why it matters: After high-profile system failures, leaders need to know whether a declared governance rule set would have blocked deployment ex ante—not a narrative after the fact. Historical replay answers that question for a fixed, pre-specified engine.

Core insight

A conjunctive, non-compensatory gate architecture changes which deployments pass compared with compensatory scoring—even when both use similar underlying measurements. The gap is empirically visible on real public failure cases, not only in toy examples.

What was done

Researchers encoded twelve well-documented failure cases (2014–2021) across seven sectors using a structured supply-chain-style feature model with explicit provenance classes (direct, derived, imputed, uncertain). A versioned overlay reconciles “core-equivalent” legacy rows to expanded evidence. A frozen five-gate engine (four threshold profiles, including “moderate”) evaluates each case; a parallel compensatory comparator scores the same evidence with offsetting allowed.

The study adds regulatory-cleared control devices, expanded benchmark tiers, Monte Carlo bands on confidence, perturbation sweeps, dual-schema invariance checks, and single- versus pairwise-gate ablations to test whether rejections are fragile artefacts of one gate.

What was found

Under the moderate profile on the twelve core failures, eleven are rejected and one is a false negative (narrow safety margin). All twelve control devices approve, supporting specificity. The safety gate binds most often; bias, calibration, and traceability failures appear repeatedly; evidence depth binds less often but matters in ablations.

Layer discipline: headline confusion-matrix statistics for the twelve-failure presentation do not match the single exported confusion-matrix file, which encodes a different evaluation layer (expanded failures plus controls). Both are documented; mixing them invalidates interpretation (P4-C10). One quantity in the sensitivity narrative shows a small manuscript–log discrepancy; the traceability row for P4-C21 preserves the engineer flag.

Removing any one gate never flips a rejection; selected pairs do—showing conjunctive structure. Compensatory scoring agrees on most failures but would approve two notable cases the non-compensatory engine rejects. Expanded benchmarks can show perfect separation under the stated encoding—a structural observation, not a claim of prospective predictive validation (P4-C25).

Why this matters for regulation, safety, and deployment

Procurement and assurance: buyers can demand replayable gate evidence instead of narrative compliance alone.
Risk committees: conjunctive rules surface multi-domain weaknesses early; compensatory models may hide them.
Post-incident review: the same frozen engine can be applied consistently across cases for audit defensibility.

Limitations and ethics

Convenience sample; heterogeneous provenance; selection and survivorship bias; not representative of all jurisdictions or deployments. Retrospective computational study on public materials only—no human subjects. Assistive tools may support documentation; scientific judgments remain author-owned (P4-C35).

Audit posture: Independent QA (2026-04-12) passed pytest, full reproduce_all.py, and strict output validation for this pin. Traceability rows preserve engineer flags such as partially verified quantities where logs and manuscript differ (P4-C21).

View technical detail — notebook walkthrough (conceptual)

No code is shown here. In the repository, four notebooks execute in order:

01 — Dataset intake Builds inventories, provenance summaries, and dataset statistics that ground encoding claims (P4-C03–C06, P4-C19).

02 — Replay execution Runs the frozen engine on core, control, and expanded layers; records verdicts, compensation comparison, and assertions for headline metrics (P4-C07–C18, P4-C22–C24).

03 — Metrics & calibration Monte Carlo stability, sensitivity sweeps, invariance and perturbation regimes (P4-C20–C29).

04 — Figures & tables Publication-style charts for gate failures, ablations, provenance, and compensation (P4-C36–C39).

Closing the box does not hide any conclusion above; it adds methodological depth for technical readers.

Full claim traceability (P4-C01–P4-C40)

One-to-one with docs/claim_traceability.md in the pinned repository. Status labels are textual (never colour-only).

Claim ID	Claim (paraphrase)	Code	Notebook	Output / path	Status
P4-C01	Institutional frameworks (NIST AI RMF, EU AI Act, ISO/IEC 23894) guide process but do not operationalise deterministic threshold gate logic.	N/A (background)	01 (narrative)	—	narrative
P4-C02	Historical replay applies a pre-specified five-gate non-compensatory engine; engine not modified for this analysis.	`engine/corrected_public_engine_v1_1.py`	02	`outputs/logs/replay_run_log.txt`	implemented
P4-C03	12 documented failure cases (2014–2021), seven sectors, convenience sample from strong documentary evidence.	`data/canonical/canonical_dataset.json`	01	`outputs/tables/dataset_inventory.csv`	implemented
P4-C04	64 independent documentary sources; 15 SCM features per case; rubric-based encoding with provenance classes.	canonical + EEE overlay	01	`outputs/tables/dataset_inventory.csv`	implemented
P4-C05	Triangulation yields 57 triangulated and 123 passthrough features; timing: 4 pre-deployment vs 60 post-incident sources.	EEE / provenance JSON	01	notebook stdout / inventory	implemented
P4-C06	20 declared feature-dependency overlaps (consistent with dependency matrix).	provenance artefacts	01	`docs/provenance.md`	documented
P4-C07	Five non-compensatory gates + four threshold profiles + parallel compensatory comparator.	`corrected_public_engine_v1_1.py`	02–04	`outputs/tables/replay_results.csv`	implemented
P4-C08	Eight core-equivalent failures upgraded via EEE overlay (v1.2.0): 16 features upgraded, 8 imputations removed, 16 confidence uplifts; two cases LOW→MODERATE.	`engine/eee_overlay_adapter.py`, overlay JSON	01–02	`config/core_equivalent_cases.json`	implemented
P4-C09	12 FDA-cleared control devices for specificity (Extended Data 5).	benchmark cases	02	`outputs/tables/replay_results.csv` (layer rows)	implemented
P4-C10	Manuscript headline 12 failures + 12 controls under moderate: TP=11, FN=1, TN=12, FP=0; sensitivity 0.917, specificity 1.000. Repo: Layer 1 (12 failures) matches 11/12 rejections and sole approve google_dr; 12 FDA controls all approve in notebook Layer 2 block. `confusion_matrix.csv` encodes Layer 2 only (20 failures + 12 controls): TP=20, FN=0, TN=12, FP=0 — not the 12+12 headline matrix.	engine	02	`notebooks/02_historical_replay_execution.ipynb` (§2.2, §2.6); `outputs/tables/confusion_matrix.csv` (Layer 2)	implemented; VERIFIED (QA 2026-04-12) — layer split / artefact scope
P4-C11	Sole false negative: google_dr (narrow margins; safety margin 0.05).	engine	02–03	`notebooks/02_historical_replay_execution.ipynb` (assertions + stdout)	implemented; VERIFIED (QA 2026-04-12)
P4-C12	Safety gate binding most often: 10/12 (83%) failures under moderate.	engine	02, 04	`outputs/figures/figure1_gate_failure.png`; notebook 02 stdout	implemented; VERIFIED (QA 2026-04-12)
P4-C13	Bias gate 6/12; calibration and traceability each 5/12; evidence gate 3/12.	engine	04	`figure1_gate_failure.png`; notebook 02 stdout	implemented; VERIFIED (QA 2026-04-12)
P4-C14	Rejected cases average 2.6 gate failures each (29 total across 11 rejections).	engine	03–04	notebook 02 stdout (mean 2.636, total 29)	implemented; VERIFIED (QA 2026-04-12) (mean rounded in manuscript)
P4-C15	Ablation: removing any single gate does not flip a rejection to approval (12 cases).	engine	04	`figure2_ablation.png`, `ablation_matrix.csv`	implemented
P4-C16	Pairwise ablation: safety+bias flips Optum, Gender Shades, UK A-levels; safety+calibration flips Google Flu & Uber AV; evidence+traceability flips Babylon.	engine	04	`figure2_ablation.png`	implemented
P4-C17	Non-compensatory vs compensatory agree on 10/12; two divergences (google_flu, uber_av) where compensatory would approve.	engine	02–04	`figure4_compensation.png`; notebook 02 stdout / assertions	implemented; VERIFIED (QA 2026-04-12)
P4-C18	Compensatory scores at divergence: Google Flu ~0.57 (threshold 0.50); Uber AV ~0.51 (threshold 0.50).	engine	02	notebook 02 stdout (0.5675, 0.5125)	implemented; VERIFIED (QA 2026-04-12)
P4-C19	Provenance mix across 180 encodings: 27.2% direct, 37.8% rule-derived, 28.3% imputed, 6.7% uncertain; mean confidence 0.591.	canonical features	01, 04	`figure3_provenance_stability.png`	implemented
P4-C20	Monte Carlo on [low, high] bands: 200 iterations, seed 42 → 12/12 outcome-stable under moderate.	engine	03	`outputs/figures/calibration_summary.txt`; `outputs/tables/metrics_summary.csv`	implemented; VERIFIED (QA 2026-04-12)
P4-C21	±0.20 sensitivity: only google_dr shows flip points (8 across 4 features); 11 rejections robust.	engine	03	`metrics_summary.csv`; `calibration_summary.txt`	implemented; partially verified (QA 2026-04-12) — flip-point count 7 in `calibration_summary.txt` / metrics vs manuscript 8
P4-C22	Expanded benchmark 91 cases (61 failures, 30 controls): 100% sensitivity & specificity; no misclassifications under moderate.	benchmark dir + engine	02	`replay_results.csv`; `outputs/logs/replay_run_log.txt`	implemented; VERIFIED (QA 2026-04-12)
P4-C23	Tier 2 (49 cases): lower mean confidence (~0.383), 4.0 mean gate failures, safety binding 100%.	benchmark metadata	01–02	dataset inventory / replay	implemented
P4-C24	Tier 3 (30 FDA-authorised devices): all APPROVE under moderate (specificity controls).	benchmark	02	replay_results expanded rows	implemented
P4-C25	Perfect separation on expanded set is a structural consequence of encoding + non-compensatory logic, not prospective validation.	N/A (interpretive)	02 (markdown)	—	narrative
P4-C26	Dual dataset structural invariance: normalised public schema vs canonical → 480/480 field comparisons identical (12×4×10).	engine	03	`outputs/tables/metrics_summary.csv` (`invariance,fields_matched,480`)	implemented; VERIFIED (QA 2026-04-12)
P4-C27	Replay vs canonical full mode: under moderate, zero verdict change on 12 cases.	engine modes	03	`metrics_summary.csv` (`mode_sensitivity,moderate_divergences,0`)	implemented; VERIFIED (QA 2026-04-12)
P4-C28	Controlled ±0.05 perturbation on calibration/bias/traceability: 46/48 verdict-stable in replay; 2 flips permissive only (Epic Sepsis, Babylon).	`data/canonical/perturbation_dataset.json`	03	`metrics_summary.csv` (`perturbation,verdicts_stable,46`; `verdict_flips,2`)	implemented; VERIFIED (QA 2026-04-12)
P4-C29	Moderate profile: zero verdict flips under that perturbation regime; compensatory Uber AV can flip near 0.50.	engine	03	`metrics_summary.csv` (`moderate_flips,0`); Uber AV score in notebook 02	implemented; VERIFIED (QA 2026-04-12) (compensatory clause via 02)
P4-C30	Three primary discriminative gates under expanded + core scopes: G1, G4, G5 (manuscript framing).	engine	02–04	gate failure charts	implemented
P4-C31	Contribution: reproducible deterministic pipeline, historical replay methodology, empirical defence-in-depth via redundant gates.	N/A	all notebooks	`reproduce_all.py`	process
P4-C32	Limitations: convenience sample, tiered provenance heterogeneity, selection/survivorship bias, not representative of all deployments.	N/A	01–02 text	—	narrative
P4-C33	Ethics: retrospective computational study on public cases; no human subjects; no ethics approval required.	N/A	—	manuscript only	narrative
P4-C34	Data/code availability: Zenodo DOI placeholder; GitHub URLs cited (external).	N/A	`repro_manifest.json`	inputs/	external
P4-C35	AI disclosure: Claude used for code/docs assistance; author retains scientific decisions.	N/A	—	manuscript only	narrative
P4-C36	Figure 1: gate failure counts (83% safety, 50% bias, 42% cal/trace, 25% evidence).	engine	04	`outputs/figures/figure1_gate_failure.png`; notebook 02 gate stdout	implemented; VERIFIED (QA 2026-04-12)
P4-C37	Figure 2: single-gate removal stable; pairwise removals as specified.	engine	04	`outputs/figures/figure2_ablation.png`; `outputs/tables/ablation_matrix.csv`	implemented; VERIFIED (QA 2026-04-12) (artefacts present + pipeline PASS)
P4-C38	Figure 3: provenance distribution + MC stability + ±0.20 sensitivity summary.	notebooks	04	`outputs/figures/figure3_provenance_stability.png`	implemented; VERIFIED (QA 2026-04-12) (figure + validation PASS)
P4-C39	Figure 4: compensation effect highlighting google_flu and uber_av.	engine	04	`outputs/figures/figure4_compensation.png`	implemented; VERIFIED (QA 2026-04-12)
P4-C40	PhysioNet / extended analyses referenced as Extended Data (not re-proven in-repo here).	external	03 (refs)	`docs/provenance.md`	out-of-notebook scope

For independent verification, run the automated test suite and the full reproduction driver from the repository root exactly as documented in that release’s README and traceability file. Executable commands are intentionally omitted from this public summary page.