Paper 2 · Simulation

How often do “reasonable” scoring rules approve unsafe clinical AI?

Monorepo root commit
Not recorded in the public portfolio system_snapshot.json (v1.2, 2026-04-11T07:37:21Z) used for this binding. Not invented on this page.
Tier-0 shared-core commit (portfolio snapshot)
cd9ad79fe16f34ad861bd6527670dcfbef8fe864
Paper 2 repository commit (released)
f7046d4c1da7029cbe362c9181a2deaf80c07b2a
Zenodo DOI
https://doi.org/10.5281/zenodo.19499791
Release version
v2.0.0 (portfolio release designation); CITATION.cff may list package version 1.0.0 — treat commit + DOI as authoritative if they diverge.
Page generated (UTC)
2026-04-12

Executive overview

Problem: Teams often choose between conjunctive gate rules, weighted composites, or informal majority heuristics without a quantitative picture of how each behaves under heterogeneous, noisy evidence.

Why it matters: Unsafe latent states can be rare yet catastrophic; small differences in deployment rules can change the rate of unsafe releases by orders of magnitude in simulation—informing how conservative governance should be when evidence is imperfect.

Core insight

Under the primary calibrated model, non-compensatory gates can drive unsafe deployment rates to zero while matched and moderate composite rules still admit material unsafe deployment mass—illustrating that aggregation design is not a cosmetic choice.

What was done

A Monte Carlo engine simulates thousands of synthetic clinical AI tools with latent safety states, five-domain evidence signals, risk-tiered non-compensatory thresholds, constrained overrides, and bootstrap confidence intervals. Comparator policies include matched-threshold composites, a moderate composite calibration, and a permissive majority baseline.

Additional notebooks sweep threshold multipliers and observation noise, run structured verification scenarios, explore calibration sensitivity over portfolio composition and unsafe priors, and assemble manuscript-aligned summary tables.

What was found

Primary-model point estimates show gates keeping unsafe deployments at zero while composite policies admit small but non-zero unsafe deployment rates; permissive baselines are materially worse. Verification scenarios (uniform failure, random failure, partial heterogeneity) stress how advantages shrink or reappear depending on failure geometry.

An Epic Sepsis–style row demonstrates five-domain refusal under gates with transparent domain outputs. Supplementary numeric reporting is generated by a dedicated reporter module.

Why this matters for regulation, safety, and deployment

  • Policy choice: surfaces quantitative trade-offs before live experimentation.
  • Assurance arguments: supports why conjunctive designs differ from scoring dashboards.
  • Risk appetite: connects threshold multipliers and noise levels to deployment outcomes.

Limitations and ethics

Simulation parameters are illustrative; not calibrated to a single health system. Companion historical statistics require Paper 4 pins. Figure hash modes may be advisory for select PNG exports—see repository QA notes.

Verification snapshot: Independent QA (2026-04-12) reported full harness pass with strict validation on core JSON/CSV artefacts; per-claim promotion notes remain in the source traceability file.

View technical detail — notebook walkthrough (conceptual)
01_primary_simulation Primary Monte Carlo, metrics, Epic scenario, bootstrap CIs (P2-C01–C08, P2-C19–C20).
02_sensitivity_and_noise Threshold sweeps and noise robustness (P2-C09–C11).
03_verification_simulations Structured failure scenarios vs comparators (P2-C12–C15).
04_calibration_sensitivity Portfolio and unsafe-probability axes (P2-C16).
05_tables_and_summary Manuscript Tables 1–2 assembly (P2-C17–C18).

Harness scripts (reproduce_all.py, validation, manifests) ground P2-C21. Supplementary reporter grounds P2-C22.

Closing this panel does not remove the executive conclusions already stated above.

Full claim traceability (P2-C01–P2-C22)

One-to-one with docs/claim_traceability.md in the pinned repository.

Claim ID Manuscript-grounded statement (paraphrase) Primary evidence (repo) Notebook / module Key outputs
P2-C01Monte Carlo simulation of 1,000 clinical AI tools with latent safety state and five-domain evidence; theory-testing frameEngine config n_tools, design in Methodsnotebooks/01_primary_simulation.ipynb, src/run_simulation.pyoutputs/data/simulation_outputs.csv, outputs/data/metrics_summary.json
P2-C02Portfolio mix 30% high-risk / 70% standard-risk reflects institutional composition narrativep_high_risk, risk tiers in configsrc/run_simulation.py, src/params_default.jsonmetrics_summary.jsonconfig
P2-C03Latent unsafe probabilities 0.35 (high-risk) and 0.15 (standard-risk)p_unsafe_high, p_unsafe_standardsrc/run_simulation.pysimulation_outputs.csv, metrics_summary.json
P2-C04Non-compensatory gates: five domains, risk-tiered thresholds; override on gates 2–4 at modelled invocation rateThreshold + override parameterssrc/run_simulation.pymetrics_summary.json (override rates)
P2-C05Weighted composite at matched threshold (to gate deploy rate) and moderate threshold (~2.2× gate rate, capped)set_threshold_to_match_rate, 2.2× rulesrc/run_simulation.py, scripts/run_direct.pymetrics_summary.jsoncomposite_thresholds
P2-C06Permissive baseline: majority rule (≥3 of 5 gates)decide_permissivesrc/run_simulation.pysimulation_outputs.csv, metrics
P2-C07Primary heterogeneous model: gates ~28.5% deployment; zero unsafe deployments; moderate composite unsafe deployment ~0.9% with CI; ~1.4% unsafe among deployedPoint estimates + bootstrap01_primary_simulation.ipynbmetrics_summary.json, figures
P2-C08Permissive baseline ~2.2% unsafe deployment rate under primary modelMetrics column Permissive01_primary_simulation.ipynbmetrics_summary.json, outputs/tables/table2_unsafe_rates.csv
P2-C09Threshold sensitivity: multipliers 60–140% of default (17 steps); gate safety under primary modelSweep implementation02_sensitivity_and_noise.ipynboutputs/data/sensitivity_thresholds.csv, fig5_*
P2-C10Noise robustness: observation SD 0.01–0.20 (15 steps in engine); matched composite calibration at each levelnoise_robustness02_sensitivity_and_noise.ipynboutputs/data/sensitivity_noise.csv, fig6_*
P2-C11Extended noise rows (low/high SD) for Table 2 noise sectionnoise_extended generation02_sensitivity_and_noise.ipynb, scripts/run_direct.pyoutputs/data/noise_extended.csv
P2-C12Verification scenarios: uniform failure, random failure, partial heterogeneity — rates vs composite (moderate/matched) and permissiveScenario definitions03_verification_simulations.ipynb, src/run_verification.pyverification_summary.csv, verification_results.json, fig_verification_*
P2-C13Uniform failure: gate advantage vs moderate composite collapses toward matched comparison (manuscript Table 1–2 narrative)Scenario outputs03_verification_simulations.ipynbverification_summary.csv, table1_scope_conditions.csv
P2-C14Random failure: small non-zero unsafe rate under gates; higher under moderate compositeScenario outputs03_verification_simulations.ipynbverification_summary.csv, tables
P2-C15Partial heterogeneity: non-zero gate unsafe rate; composite moderate higherScenario outputs03_verification_simulations.ipynbverification_summary.csv, tables
P2-C16Calibration sensitivity: portfolio composition and unsafe-probability sweeps (supplementary / robustness narrative)Formal axes in appendix reference04_calibration_sensitivity.ipynb, src/run_calibration_sensitivity.pycalibration_portfolio.csv, calibration_unsafe_prob.csv, fig_portfolio_*, fig_unsafe_prob_*
P2-C17Table 1: scope-condition summary (mechanism column aligned to manuscript)Assembled from primary + verification05_tables_and_summary.ipynboutputs/tables/table1_scope_conditions.csv
P2-C18Table 2: unsafe deployment rates across scenarios and rulesJoins primary metrics + verification + noise05_tables_and_summary.ipynboutputs/tables/table2_unsafe_rates.csv
P2-C19Epic Sepsis case scenario: under gates, fails all five domains with transparent domain-level refusalepic_case_row, decisions01_primary_simulation.ipynbepic_case_outputs.csv, fig4_epic_case.*
P2-C20Bootstrap 95% CIs (1,000 resamples) for ratesbootstrap_ci, n_bootstrapsrc/run_simulation.pymetrics_summary.json, verification_results.json
P2-C21Reproducibility: single-command pipeline and manifest validationHarnessreproduce_all.py, scripts/validate_outputs.py, scripts/hash_manifest.pylogs/actual_manifest.json, config/expected_outputs.json
P2-C22Aggregated supplementary numeric report (appendix alignment)Reportersrc/report_supplementary.pyoutputs/logs/supplementary_report.txt