Paper 4 · Historical replay
When documented AI failures meet a frozen non-compensatory gate set
- Monorepo root commit
- Not recorded in the public portfolio
system_snapshot.json(v1.2, 2026-04-11T07:37:21Z) used for this binding. Not invented on this page. - Tier-0 shared-core commit (portfolio snapshot)
- cd9ad79fe16f34ad861bd6527670dcfbef8fe864
- Paper 4 repository commit (released)
- 061744534c14872268bae3d596511a7ea0ec9081
- Zenodo DOI
- https://doi.org/10.5281/zenodo.19388835
- Release version
- v2.0.0 (portfolio release designation); CITATION.cff may list package version 1.0.0 — treat commit + DOI as authoritative if they diverge.
- Page generated (UTC)
- 2026-04-12
Executive overview
Problem: Policy documents describe trustworthy AI principles, but teams still lack transparent, replayable rules for “stop vs go” that do not let one strong score mask a failing domain.
Why it matters: After high-profile system failures, leaders need to know whether a declared governance rule set would have blocked deployment ex ante—not a narrative after the fact. Historical replay answers that question for a fixed, pre-specified engine.
Core insight
A conjunctive, non-compensatory gate architecture changes which deployments pass compared with compensatory scoring—even when both use similar underlying measurements. The gap is empirically visible on real public failure cases, not only in toy examples.
What was done
Researchers encoded twelve well-documented failure cases (2014–2021) across seven sectors using a structured supply-chain-style feature model with explicit provenance classes (direct, derived, imputed, uncertain). A versioned overlay reconciles “core-equivalent” legacy rows to expanded evidence. A frozen five-gate engine (four threshold profiles, including “moderate”) evaluates each case; a parallel compensatory comparator scores the same evidence with offsetting allowed.
The study adds regulatory-cleared control devices, expanded benchmark tiers, Monte Carlo bands on confidence, perturbation sweeps, dual-schema invariance checks, and single- versus pairwise-gate ablations to test whether rejections are fragile artefacts of one gate.
What was found
Under the moderate profile on the twelve core failures, eleven are rejected and one is a false negative (narrow safety margin). All twelve control devices approve, supporting specificity. The safety gate binds most often; bias, calibration, and traceability failures appear repeatedly; evidence depth binds less often but matters in ablations.
Layer discipline: headline confusion-matrix statistics for the twelve-failure presentation do not match the single exported confusion-matrix file, which encodes a different evaluation layer (expanded failures plus controls). Both are documented; mixing them invalidates interpretation (P4-C10). One quantity in the sensitivity narrative shows a small manuscript–log discrepancy; the traceability row for P4-C21 preserves the engineer flag.
Removing any one gate never flips a rejection; selected pairs do—showing conjunctive structure. Compensatory scoring agrees on most failures but would approve two notable cases the non-compensatory engine rejects. Expanded benchmarks can show perfect separation under the stated encoding—a structural observation, not a claim of prospective predictive validation (P4-C25).
Why this matters for regulation, safety, and deployment
- Procurement and assurance: buyers can demand replayable gate evidence instead of narrative compliance alone.
- Risk committees: conjunctive rules surface multi-domain weaknesses early; compensatory models may hide them.
- Post-incident review: the same frozen engine can be applied consistently across cases for audit defensibility.
Limitations and ethics
Convenience sample; heterogeneous provenance; selection and survivorship bias; not representative of all jurisdictions or deployments. Retrospective computational study on public materials only—no human subjects. Assistive tools may support documentation; scientific judgments remain author-owned (P4-C35).
Audit posture: Independent QA (2026-04-12) passed pytest, full reproduce_all.py, and strict output validation for this pin. Traceability rows preserve engineer flags such as partially verified quantities where logs and manuscript differ (P4-C21).
View technical detail — notebook walkthrough (conceptual)
No code is shown here. In the repository, four notebooks execute in order:
Closing the box does not hide any conclusion above; it adds methodological depth for technical readers.
Full claim traceability (P4-C01–P4-C40)
One-to-one with docs/claim_traceability.md in the pinned repository. Status labels are textual (never colour-only).
| Claim ID | Claim (paraphrase) | Code | Notebook | Output / path | Status |
|---|---|---|---|---|---|
| P4-C01 | Institutional frameworks (NIST AI RMF, EU AI Act, ISO/IEC 23894) guide process but do not operationalise deterministic threshold gate logic. | N/A (background) | 01 (narrative) | — | narrative |
| P4-C02 | Historical replay applies a pre-specified five-gate non-compensatory engine; engine not modified for this analysis. | engine/corrected_public_engine_v1_1.py | 02 | outputs/logs/replay_run_log.txt | implemented |
| P4-C03 | 12 documented failure cases (2014–2021), seven sectors, convenience sample from strong documentary evidence. | data/canonical/canonical_dataset.json | 01 | outputs/tables/dataset_inventory.csv | implemented |
| P4-C04 | 64 independent documentary sources; 15 SCM features per case; rubric-based encoding with provenance classes. | canonical + EEE overlay | 01 | outputs/tables/dataset_inventory.csv | implemented |
| P4-C05 | Triangulation yields 57 triangulated and 123 passthrough features; timing: 4 pre-deployment vs 60 post-incident sources. | EEE / provenance JSON | 01 | notebook stdout / inventory | implemented |
| P4-C06 | 20 declared feature-dependency overlaps (consistent with dependency matrix). | provenance artefacts | 01 | docs/provenance.md | documented |
| P4-C07 | Five non-compensatory gates + four threshold profiles + parallel compensatory comparator. | corrected_public_engine_v1_1.py | 02–04 | outputs/tables/replay_results.csv | implemented |
| P4-C08 | Eight core-equivalent failures upgraded via EEE overlay (v1.2.0): 16 features upgraded, 8 imputations removed, 16 confidence uplifts; two cases LOW→MODERATE. | engine/eee_overlay_adapter.py, overlay JSON | 01–02 | config/core_equivalent_cases.json | implemented |
| P4-C09 | 12 FDA-cleared control devices for specificity (Extended Data 5). | benchmark cases | 02 | outputs/tables/replay_results.csv (layer rows) | implemented |
| P4-C10 | Manuscript headline 12 failures + 12 controls under moderate: TP=11, FN=1, TN=12, FP=0; sensitivity 0.917, specificity 1.000. Repo: Layer 1 (12 failures) matches 11/12 rejections and sole approve google_dr; 12 FDA controls all approve in notebook Layer 2 block. confusion_matrix.csv encodes Layer 2 only (20 failures + 12 controls): TP=20, FN=0, TN=12, FP=0 — not the 12+12 headline matrix. | engine | 02 | notebooks/02_historical_replay_execution.ipynb (§2.2, §2.6); outputs/tables/confusion_matrix.csv (Layer 2) | implemented; VERIFIED (QA 2026-04-12) — layer split / artefact scope |
| P4-C11 | Sole false negative: google_dr (narrow margins; safety margin 0.05). | engine | 02–03 | notebooks/02_historical_replay_execution.ipynb (assertions + stdout) | implemented; VERIFIED (QA 2026-04-12) |
| P4-C12 | Safety gate binding most often: 10/12 (83%) failures under moderate. | engine | 02, 04 | outputs/figures/figure1_gate_failure.png; notebook 02 stdout | implemented; VERIFIED (QA 2026-04-12) |
| P4-C13 | Bias gate 6/12; calibration and traceability each 5/12; evidence gate 3/12. | engine | 04 | figure1_gate_failure.png; notebook 02 stdout | implemented; VERIFIED (QA 2026-04-12) |
| P4-C14 | Rejected cases average 2.6 gate failures each (29 total across 11 rejections). | engine | 03–04 | notebook 02 stdout (mean 2.636, total 29) | implemented; VERIFIED (QA 2026-04-12) (mean rounded in manuscript) |
| P4-C15 | Ablation: removing any single gate does not flip a rejection to approval (12 cases). | engine | 04 | figure2_ablation.png, ablation_matrix.csv | implemented |
| P4-C16 | Pairwise ablation: safety+bias flips Optum, Gender Shades, UK A-levels; safety+calibration flips Google Flu & Uber AV; evidence+traceability flips Babylon. | engine | 04 | figure2_ablation.png | implemented |
| P4-C17 | Non-compensatory vs compensatory agree on 10/12; two divergences (google_flu, uber_av) where compensatory would approve. | engine | 02–04 | figure4_compensation.png; notebook 02 stdout / assertions | implemented; VERIFIED (QA 2026-04-12) |
| P4-C18 | Compensatory scores at divergence: Google Flu ~0.57 (threshold 0.50); Uber AV ~0.51 (threshold 0.50). | engine | 02 | notebook 02 stdout (0.5675, 0.5125) | implemented; VERIFIED (QA 2026-04-12) |
| P4-C19 | Provenance mix across 180 encodings: 27.2% direct, 37.8% rule-derived, 28.3% imputed, 6.7% uncertain; mean confidence 0.591. | canonical features | 01, 04 | figure3_provenance_stability.png | implemented |
| P4-C20 | Monte Carlo on [low, high] bands: 200 iterations, seed 42 → 12/12 outcome-stable under moderate. | engine | 03 | outputs/figures/calibration_summary.txt; outputs/tables/metrics_summary.csv | implemented; VERIFIED (QA 2026-04-12) |
| P4-C21 | ±0.20 sensitivity: only google_dr shows flip points (8 across 4 features); 11 rejections robust. | engine | 03 | metrics_summary.csv; calibration_summary.txt | implemented; partially verified (QA 2026-04-12) — flip-point count 7 in calibration_summary.txt / metrics vs manuscript 8 |
| P4-C22 | Expanded benchmark 91 cases (61 failures, 30 controls): 100% sensitivity & specificity; no misclassifications under moderate. | benchmark dir + engine | 02 | replay_results.csv; outputs/logs/replay_run_log.txt | implemented; VERIFIED (QA 2026-04-12) |
| P4-C23 | Tier 2 (49 cases): lower mean confidence (~0.383), 4.0 mean gate failures, safety binding 100%. | benchmark metadata | 01–02 | dataset inventory / replay | implemented |
| P4-C24 | Tier 3 (30 FDA-authorised devices): all APPROVE under moderate (specificity controls). | benchmark | 02 | replay_results expanded rows | implemented |
| P4-C25 | Perfect separation on expanded set is a structural consequence of encoding + non-compensatory logic, not prospective validation. | N/A (interpretive) | 02 (markdown) | — | narrative |
| P4-C26 | Dual dataset structural invariance: normalised public schema vs canonical → 480/480 field comparisons identical (12×4×10). | engine | 03 | outputs/tables/metrics_summary.csv (invariance,fields_matched,480) | implemented; VERIFIED (QA 2026-04-12) |
| P4-C27 | Replay vs canonical full mode: under moderate, zero verdict change on 12 cases. | engine modes | 03 | metrics_summary.csv (mode_sensitivity,moderate_divergences,0) | implemented; VERIFIED (QA 2026-04-12) |
| P4-C28 | Controlled ±0.05 perturbation on calibration/bias/traceability: 46/48 verdict-stable in replay; 2 flips permissive only (Epic Sepsis, Babylon). | data/canonical/perturbation_dataset.json | 03 | metrics_summary.csv (perturbation,verdicts_stable,46; verdict_flips,2) | implemented; VERIFIED (QA 2026-04-12) |
| P4-C29 | Moderate profile: zero verdict flips under that perturbation regime; compensatory Uber AV can flip near 0.50. | engine | 03 | metrics_summary.csv (moderate_flips,0); Uber AV score in notebook 02 | implemented; VERIFIED (QA 2026-04-12) (compensatory clause via 02) |
| P4-C30 | Three primary discriminative gates under expanded + core scopes: G1, G4, G5 (manuscript framing). | engine | 02–04 | gate failure charts | implemented |
| P4-C31 | Contribution: reproducible deterministic pipeline, historical replay methodology, empirical defence-in-depth via redundant gates. | N/A | all notebooks | reproduce_all.py | process |
| P4-C32 | Limitations: convenience sample, tiered provenance heterogeneity, selection/survivorship bias, not representative of all deployments. | N/A | 01–02 text | — | narrative |
| P4-C33 | Ethics: retrospective computational study on public cases; no human subjects; no ethics approval required. | N/A | — | manuscript only | narrative |
| P4-C34 | Data/code availability: Zenodo DOI placeholder; GitHub URLs cited (external). | N/A | repro_manifest.json | inputs/ | external |
| P4-C35 | AI disclosure: Claude used for code/docs assistance; author retains scientific decisions. | N/A | — | manuscript only | narrative |
| P4-C36 | Figure 1: gate failure counts (83% safety, 50% bias, 42% cal/trace, 25% evidence). | engine | 04 | outputs/figures/figure1_gate_failure.png; notebook 02 gate stdout | implemented; VERIFIED (QA 2026-04-12) |
| P4-C37 | Figure 2: single-gate removal stable; pairwise removals as specified. | engine | 04 | outputs/figures/figure2_ablation.png; outputs/tables/ablation_matrix.csv | implemented; VERIFIED (QA 2026-04-12) (artefacts present + pipeline PASS) |
| P4-C38 | Figure 3: provenance distribution + MC stability + ±0.20 sensitivity summary. | notebooks | 04 | outputs/figures/figure3_provenance_stability.png | implemented; VERIFIED (QA 2026-04-12) (figure + validation PASS) |
| P4-C39 | Figure 4: compensation effect highlighting google_flu and uber_av. | engine | 04 | outputs/figures/figure4_compensation.png | implemented; VERIFIED (QA 2026-04-12) |
| P4-C40 | PhysioNet / extended analyses referenced as Extended Data (not re-proven in-repo here). | external | 03 (refs) | docs/provenance.md | out-of-notebook scope |
For independent verification, run the automated test suite and the full reproduction driver from the repository root exactly as documented in that release’s README and traceability file. Executable commands are intentionally omitted from this public summary page.