Paper 3 · Framework comparison
Beyond undocumented thresholds: a six-layer justification stack
- Monorepo root commit
- Not recorded in the public portfolio
system_snapshot.json(v1.2, 2026-04-11T07:37:21Z) used for this binding. Not invented on this page. - Tier-0 shared-core commit (portfolio snapshot)
- cd9ad79fe16f34ad861bd6527670dcfbef8fe864
- Paper 3 repository commit (released)
- 951fb441e7564e9e84c2d0ccdb03578d3e167ae6
- Zenodo DOI
- https://doi.org/10.5281/zenodo.19499798
- Release version
- v2.0.0 (portfolio release designation); CITATION.cff may list package version 1.0.0 — treat commit + DOI as authoritative if they diverge.
- Page generated (UTC)
- 2026-04-12
Executive overview
Problem: Clinical AI systems depend on numeric thresholds (performance targets, alert cut-offs, deployment gates) whose evidentiary justification is often thin—“magic numbers” embedded in models without traceable documentation.
Why it matters: Undocumented thresholds are hard to audit, hard to defend after incidents, and easy to game. A structured documentation stack makes expectations explicit for safety leaders and committee review.
Core insight
Threshold failures cluster into a small number of structural mechanisms (proxying, context collapse, coupling, boundary gaming, epistemic asymmetry, audit disconnect). Addressing them requires layered documentation—not a single R² figure in isolation.
What was done
The repository renders manuscript Tables 1–3 into structured CSV from JSON sources: six failure mechanisms, the six-layer Threshold Justification Stack with tiered requirements (primary safety vs secondary operational), and an interpretive comparison to other governance instruments. Supplementary pathways cover gaming resistance sketches and NHS governance artefact mapping tables; illustrative TJS records demonstrate schema completion for both tiers.
What was found
The artefacts make the epistemic status of each table explicit—what is empirical versus interpretive—and separate in-repo reproduction scope from companion empirical papers (see boundary statement in the repository). Deterministic notebook outputs are hash-validated against expected manifests in QA.
Why this matters for regulation, safety, and deployment
- Committees: receive a checklist-like stack for what must be documented before thresholds are accepted.
- Integration: maps TJS layers to existing hospital risk artefacts to reduce duplicate paperwork while preserving safety intent.
- Integrity: gaming pathways are named so assessments are proactive rather than purely reactive.
Limitations and ethics
Framework comparisons are author interpretive codings, not regulators’ official positions. Feasibility pilots, inter-rater reliability, and some empirical validations are explicitly required before scalability claims (P3-C14–C15) and sit outside this repository’s numerical scope per docs/claim_boundary_statement.md.
Reproducibility: QA session 2026-04-12 reported reproduce_all.py with VALIDATION PASSED for pinned outputs referenced in the traceability matrix.
View technical detail — notebook walkthrough (conceptual)
Closing this panel does not remove the executive conclusions already stated above.
Full claim traceability (P3-C01–P3-C20)
Two sub-tables mirror docs/claim_traceability.md (manuscript-grounded claims; then repository fidelity claims).
Manuscript-grounded claims
| Claim ID | Claim (paraphrase) | Manuscript anchor | Notebook / code | Output / evidence path | Status |
|---|---|---|---|---|---|
| P3-C01 | Clinical AI governance often operationalises safety via quantitative thresholds whose methodological rationale is undocumented (“magic numbers”). | Introduction; “What is already known” | notebooks/01_tjs_framework_and_failure_mechanisms.ipynb | Narrative setup; data/tables/table1_failure_mechanisms.json context | Traced |
| P3-C02 | Conceptual synthesis identified six structural failure mechanisms in threshold design (proxy thresholding; context collapse; threshold coupling; boundary gaming; epistemic asymmetry; audit–lifecycle disconnect). | Results; Table 1 | 01_…ipynb | outputs/tables/table1_failure_mechanisms.csv; data/tables/table1_failure_mechanisms.json (6 data rows) | VERIFIED (QA 2026-04-12: tabular row count + repro pipeline) |
| P3-C03 | Methods combine conceptual synthesis and regulatory analysis (targeted searches, Jan 2018–Dec 2025; structural inclusion criterion). | Methods | 01_…ipynb (epistemic notice); manuscript/Paper3_Manuscript.docx | data/tables/table1_failure_mechanisms.json (_caveat) | Traced |
| P3-C04 | The six mechanisms are argued structurally distinct and to cover major structural failure modes; not claimed exhaustive; inter-rater exercise proposed. | Methods (Reproducibility subsection) | 01_…ipynb | JSON _caveat on interpretive tables | Traced |
| P3-C05 | Regulatory alignment assessments are interpretive inferences by the author, not statements of regulatory intent or legal requirement. | Methods | 01_…ipynb; 02_…ipynb | Table JSON _caveat fields | Traced |
| P3-C06 | Threshold Justification Stack (TJS) specifies six documentation layers with tiered requirements: Primary Safety (failure can directly harm patients) vs Secondary Operational; Primary applies by default when uncertain. | Results; Table 2 | 01_…ipynb | outputs/tables/table2_tjs_specification.csv; data/tables/table2_tjs_specification.json (6 data rows; tier fields) | VERIFIED (QA 2026-04-12: tabular structure + repro pipeline) |
| P3-C07 | Table 2 documents each TJS layer (description, mechanisms addressed, tier requirement, regulatory counterpart). | Table 2 | 01_…ipynb | data/tables/table2_tjs_specification.json → CSV | VERIFIED (QA 2026-04-12: same artefact as C06) |
| P3-C08 | Table 3 compares TJS threshold documentation expectations to other governance instruments; all characterisations interpretive. | Results; Table 3 | 02_…ipynb | outputs/tables/table3_framework_comparison.csv | VERIFIED (QA 2026-04-12: artefact + hash manifest; interpretive caveat retained) |
| P3-C09 | TJS is positioned to augment NHS clinical risk management under DCB0129/0160, not replace hazard logs. | Discussion / integration | 02_…ipynb (mappings); 03_…ipynb | Narrative in notebooks; outputs/tables/nhs_governance_mapping.csv | Traced |
| P3-C10 | Hospital integration entails procedural changes (tier classification; populate layers; route record to committee); field mapping to audit schemas in Supplementary Appendix C (see supplementary PDF). | Discussion | 02_…ipynb; 03_…ipynb | inputs/supplementary.pdf; schema demos | Traced |
| P3-C11 | Gaming resistance pathways and assessment methodology are detailed in Supplementary Appendix F. | Results / Discussion | 02_…ipynb | outputs/tables/gaming_resistance_pathways.csv; inputs/supplementary.pdf | Traced |
| P3-C12 | NHS governance artefact mapping is specified in Supplementary Appendix G (see supplementary PDF). | Supplementary (referenced in manuscript) | 02_…ipynb | outputs/tables/nhs_governance_mapping.csv; inputs/supplementary.pdf | Traced |
| P3-C13 | Full worked examples for both threshold tiers appear in supplementary appendices. | Abstract; Methods | 03_…ipynb | outputs/schemas/*_rendered.json; supplementary | Traced |
| P3-C14 | Empirical validation (e.g. inter-rater reliability; feasibility pilot with median completion time and kappa for tier classification) is required before scalability claims. | Conclusions; Discussion | 04_…ipynb (scope checks); docs/claim_boundary_statement.md | QA harness; boundary doc XC-7 | Traced |
| P3-C15 | TJS is advanced as a normative governance proposal; adoption at scale requires empirical feasibility assessment. | Abstract; Discussion | 01_…ipynb; 03_…ipynb (epistemic notices) | Schema _epistemic_status | Traced |
Repository fidelity claims (claim_boundary_statement.md)
| Claim ID | Claim | Source doc | Notebook / script | Output / artefact | Status |
|---|---|---|---|---|---|
| P3-C16 | Manuscript Tables 1–3 rendered as structured CSV from JSON sources extracted from the manuscript text. | RC-1 | 01_…ipynb, 02_…ipynb | outputs/tables/table1_failure_mechanisms.csv, table2_tjs_specification.csv, table3_framework_comparison.csv | VERIFIED (QA 2026-04-12: python reproduce_all.py + VALIDATION PASSED; hashes match config/expected_outputs.json) |
| P3-C17 | Glossary of key terms consolidated from the manuscript. | RC-2 | 02_…ipynb | outputs/tables/glossary.csv | VERIFIED (QA 2026-04-12: same validation pass) |
| P3-C18 | TJS layer → NHS governance artefacts mapping as specified in Supplementary Appendix G. | RC-3 | 02_…ipynb | outputs/tables/nhs_governance_mapping.csv | VERIFIED (QA 2026-04-12: same validation pass) |
| P3-C19 | Two illustrative TJS records (Primary Safety; Secondary Operational) demonstrate the audit schema from worked examples. | RC-4 | 03_…ipynb | outputs/schemas/tjs_record_primary_safety_rendered.json, tjs_record_secondary_operational_rendered.json | VERIFIED (QA 2026-04-12: same validation pass) |
| P3-C20 | Notebook outputs are deterministic and hash-validated against baselines. | RC-5 | reproduce_all.py → scripts/hash_manifest.py, scripts/validate_outputs.py | config/expected_outputs.json, logs/actual_manifest.json | VERIFIED (QA 2026-04-12: VALIDATION PASSED; manifests aligned in session) |