Paper 2 · Simulation
How often do “reasonable” scoring rules approve unsafe clinical AI?
- Monorepo root commit
- Not recorded in the public portfolio
system_snapshot.json(v1.2, 2026-04-11T07:37:21Z) used for this binding. Not invented on this page. - Tier-0 shared-core commit (portfolio snapshot)
- cd9ad79fe16f34ad861bd6527670dcfbef8fe864
- Paper 2 repository commit (released)
- f7046d4c1da7029cbe362c9181a2deaf80c07b2a
- Zenodo DOI
- https://doi.org/10.5281/zenodo.19499791
- Release version
- v2.0.0 (portfolio release designation); CITATION.cff may list package version 1.0.0 — treat commit + DOI as authoritative if they diverge.
- Page generated (UTC)
- 2026-04-12
Executive overview
Problem: Teams often choose between conjunctive gate rules, weighted composites, or informal majority heuristics without a quantitative picture of how each behaves under heterogeneous, noisy evidence.
Why it matters: Unsafe latent states can be rare yet catastrophic; small differences in deployment rules can change the rate of unsafe releases by orders of magnitude in simulation—informing how conservative governance should be when evidence is imperfect.
Core insight
Under the primary calibrated model, non-compensatory gates can drive unsafe deployment rates to zero while matched and moderate composite rules still admit material unsafe deployment mass—illustrating that aggregation design is not a cosmetic choice.
What was done
A Monte Carlo engine simulates thousands of synthetic clinical AI tools with latent safety states, five-domain evidence signals, risk-tiered non-compensatory thresholds, constrained overrides, and bootstrap confidence intervals. Comparator policies include matched-threshold composites, a moderate composite calibration, and a permissive majority baseline.
Additional notebooks sweep threshold multipliers and observation noise, run structured verification scenarios, explore calibration sensitivity over portfolio composition and unsafe priors, and assemble manuscript-aligned summary tables.
What was found
Primary-model point estimates show gates keeping unsafe deployments at zero while composite policies admit small but non-zero unsafe deployment rates; permissive baselines are materially worse. Verification scenarios (uniform failure, random failure, partial heterogeneity) stress how advantages shrink or reappear depending on failure geometry.
An Epic Sepsis–style row demonstrates five-domain refusal under gates with transparent domain outputs. Supplementary numeric reporting is generated by a dedicated reporter module.
Why this matters for regulation, safety, and deployment
- Policy choice: surfaces quantitative trade-offs before live experimentation.
- Assurance arguments: supports why conjunctive designs differ from scoring dashboards.
- Risk appetite: connects threshold multipliers and noise levels to deployment outcomes.
Limitations and ethics
Simulation parameters are illustrative; not calibrated to a single health system. Companion historical statistics require Paper 4 pins. Figure hash modes may be advisory for select PNG exports—see repository QA notes.
Verification snapshot: Independent QA (2026-04-12) reported full harness pass with strict validation on core JSON/CSV artefacts; per-claim promotion notes remain in the source traceability file.
View technical detail — notebook walkthrough (conceptual)
Harness scripts (reproduce_all.py, validation, manifests) ground P2-C21. Supplementary reporter grounds P2-C22.
Closing this panel does not remove the executive conclusions already stated above.
Full claim traceability (P2-C01–P2-C22)
One-to-one with docs/claim_traceability.md in the pinned repository.
| Claim ID | Manuscript-grounded statement (paraphrase) | Primary evidence (repo) | Notebook / module | Key outputs |
|---|---|---|---|---|
| P2-C01 | Monte Carlo simulation of 1,000 clinical AI tools with latent safety state and five-domain evidence; theory-testing frame | Engine config n_tools, design in Methods | notebooks/01_primary_simulation.ipynb, src/run_simulation.py | outputs/data/simulation_outputs.csv, outputs/data/metrics_summary.json |
| P2-C02 | Portfolio mix 30% high-risk / 70% standard-risk reflects institutional composition narrative | p_high_risk, risk tiers in config | src/run_simulation.py, src/params_default.json | metrics_summary.json → config |
| P2-C03 | Latent unsafe probabilities 0.35 (high-risk) and 0.15 (standard-risk) | p_unsafe_high, p_unsafe_standard | src/run_simulation.py | simulation_outputs.csv, metrics_summary.json |
| P2-C04 | Non-compensatory gates: five domains, risk-tiered thresholds; override on gates 2–4 at modelled invocation rate | Threshold + override parameters | src/run_simulation.py | metrics_summary.json (override rates) |
| P2-C05 | Weighted composite at matched threshold (to gate deploy rate) and moderate threshold (~2.2× gate rate, capped) | set_threshold_to_match_rate, 2.2× rule | src/run_simulation.py, scripts/run_direct.py | metrics_summary.json → composite_thresholds |
| P2-C06 | Permissive baseline: majority rule (≥3 of 5 gates) | decide_permissive | src/run_simulation.py | simulation_outputs.csv, metrics |
| P2-C07 | Primary heterogeneous model: gates ~28.5% deployment; zero unsafe deployments; moderate composite unsafe deployment ~0.9% with CI; ~1.4% unsafe among deployed | Point estimates + bootstrap | 01_primary_simulation.ipynb | metrics_summary.json, figures |
| P2-C08 | Permissive baseline ~2.2% unsafe deployment rate under primary model | Metrics column Permissive | 01_primary_simulation.ipynb | metrics_summary.json, outputs/tables/table2_unsafe_rates.csv |
| P2-C09 | Threshold sensitivity: multipliers 60–140% of default (17 steps); gate safety under primary model | Sweep implementation | 02_sensitivity_and_noise.ipynb | outputs/data/sensitivity_thresholds.csv, fig5_* |
| P2-C10 | Noise robustness: observation SD 0.01–0.20 (15 steps in engine); matched composite calibration at each level | noise_robustness | 02_sensitivity_and_noise.ipynb | outputs/data/sensitivity_noise.csv, fig6_* |
| P2-C11 | Extended noise rows (low/high SD) for Table 2 noise section | noise_extended generation | 02_sensitivity_and_noise.ipynb, scripts/run_direct.py | outputs/data/noise_extended.csv |
| P2-C12 | Verification scenarios: uniform failure, random failure, partial heterogeneity — rates vs composite (moderate/matched) and permissive | Scenario definitions | 03_verification_simulations.ipynb, src/run_verification.py | verification_summary.csv, verification_results.json, fig_verification_* |
| P2-C13 | Uniform failure: gate advantage vs moderate composite collapses toward matched comparison (manuscript Table 1–2 narrative) | Scenario outputs | 03_verification_simulations.ipynb | verification_summary.csv, table1_scope_conditions.csv |
| P2-C14 | Random failure: small non-zero unsafe rate under gates; higher under moderate composite | Scenario outputs | 03_verification_simulations.ipynb | verification_summary.csv, tables |
| P2-C15 | Partial heterogeneity: non-zero gate unsafe rate; composite moderate higher | Scenario outputs | 03_verification_simulations.ipynb | verification_summary.csv, tables |
| P2-C16 | Calibration sensitivity: portfolio composition and unsafe-probability sweeps (supplementary / robustness narrative) | Formal axes in appendix reference | 04_calibration_sensitivity.ipynb, src/run_calibration_sensitivity.py | calibration_portfolio.csv, calibration_unsafe_prob.csv, fig_portfolio_*, fig_unsafe_prob_* |
| P2-C17 | Table 1: scope-condition summary (mechanism column aligned to manuscript) | Assembled from primary + verification | 05_tables_and_summary.ipynb | outputs/tables/table1_scope_conditions.csv |
| P2-C18 | Table 2: unsafe deployment rates across scenarios and rules | Joins primary metrics + verification + noise | 05_tables_and_summary.ipynb | outputs/tables/table2_unsafe_rates.csv |
| P2-C19 | Epic Sepsis case scenario: under gates, fails all five domains with transparent domain-level refusal | epic_case_row, decisions | 01_primary_simulation.ipynb | epic_case_outputs.csv, fig4_epic_case.* |
| P2-C20 | Bootstrap 95% CIs (1,000 resamples) for rates | bootstrap_ci, n_bootstrap | src/run_simulation.py | metrics_summary.json, verification_results.json |
| P2-C21 | Reproducibility: single-command pipeline and manifest validation | Harness | reproduce_all.py, scripts/validate_outputs.py, scripts/hash_manifest.py | logs/actual_manifest.json, config/expected_outputs.json |
| P2-C22 | Aggregated supplementary numeric report (appendix alignment) | Reporter | src/report_supplementary.py | outputs/logs/supplementary_report.txt |