"Publication-grade medical prediction workflow with strict anti-data-leakage controls, phenotype-definition safeguards, lineage-based leakage detection, split-protocol verification, class-imbalance policy validation, hyperparameter-tuning isolation checks, falsification tests, and reproducibility gates. Use when building, reviewing, or debugging disease risk or prognosis models in EHR/claims/registry data, especially when target definitions, diagnosis codes, lab criteria, medications, temporal windows, and derived features can leak target information."
Resources
12Install
npx skillscat add furinaaa-cancan/medical-ml-leakage-guard Install via the SkillsCat registry.
ML Leakage Guard
AI 操作指引(Quick Dispatch)
当用户提出请求时,按以下决策树选择操作路径:
用户意图 → 操作命令
| 用户说的 | 你该做的 |
|---|---|
| "帮我训练一个模型" / "跑一下预测" | python3 scripts/mlgg.py play — 启动交互向导 |
| "用我的数据训练" / "我有一个 CSV" | python3 scripts/mlgg.py play → 选"使用自己的数据集" |
| "查看训练结果" / "结果怎么样" | python3 scripts/quick_summary.py <output_dir> |
| "下载一个测试数据集" | python3 examples/download_real_data.py <name> (heart/breast/pima/mammographic/thyroid/eeg_eye/vitaldb/framingham/diabetes130/diabetes130_full/rhc/sepsis_survival) |
| "下载 CDC 数据集" | python3 examples/download_cdc_data.py <name> (brfss/nhis/covid/all) |
| "下载 NHANES 数据集" | python3 examples/download_nhanes.py --cycles both --output examples/nhanes_diabetes.csv |
| "下载 NCI 癌症数据" | python3 examples/download_nci_gdc.py --output examples/nci_gdc_cancer_survival.csv |
| "审查论文 Methods (Qwen)" | DASHSCOPE_API_KEY=sk-... python3 experiments/paper/review_methods_llm.py --pmcid PMCxxxxxx |
| "Methods vs Code 比对" | python3 experiments/paper/compare_methods_vs_code.py --methods-dir ... --audit-log ... --blind-list ... --output ... |
| "统计分析" | python3 experiments/paper/statistical_analysis.py --output experiments/paper/output/statistical_results.json |
| "过夜批量跑 pipeline" | nohup bash experiments/overnight_pipeline_run.sh > experiments/overnight_run.log 2>&1 & |
| "严格审计" / "出版级验证" | python3 scripts/mlgg.py workflow --strict |
| "检查环境" / "安装有问题" | python3 scripts/mlgg.py doctor |
| "初始化项目" | python3 scripts/mlgg.py onboarding |
| "对比两次运行" | python3 scripts/compare_runs.py --run-a <dir1> --run-b <dir2> |
| "生成修复计划" | python3 scripts/remediation_plan.py --evidence-dir <dir> |
| "解释某个 gate 失败" | python3 scripts/explain_gate.py --report <gate_report.json> |
| "检查代码是否有数据泄漏" | python3 scripts/mlgg.py lint check <file.py> |
| "检查代码(JSON 给 agent)" | python3 scripts/mlgg.py lint check <file.py> --format json |
| "检查代码(CI 门控)" | python3 scripts/mlgg.py lint check <dir> --exit-code |
| "SHAP 可解释性" / "特征重要性" | python3 scripts/shap_interpretability_gate.py --model-pool evidence/model_pool.pkl --train-data data/train.csv --test-data data/test.csv --target-col y --report evidence/shap_interpretability_report.json |
| "数据探索" / "样本量够不够" / "EPV" | python3 scripts/cohort_definition_gate.py --data data.csv --target-col y --id-col patient_id --report evidence/cohort_report.json |
| "横截面数据" / "survey 数据" / "NHANES" | python3 scripts/split_data.py --input data.csv --strategy stratified_grouped --cross-sectional --patient-id-col patient_id --target-col y --output-dir data/ |
| "校准怎么样" / "calibration slope" | 查看 calibration_metrics() in _gate_utils.py:校准截距/斜率/O:E/ECE/Hosmer-Lemeshow/Brier Skill Score |
| "NRI IDI" / "模型比较改善" | 调用 compute_nri_idi(y_true, y_old, y_new) in _gate_utils.py:分类 NRI、连续 NRI、IDI |
| "学习曲线" / "数据量够不够" | 调用 learning_curve_data(estimator, X_train, y_train, X_test, y_test) in _gate_utils.py |
| "VIF" / "共线性" / "多重共线性" | 调用 compute_vif(X, feature_names) in _gate_utils.py:VIF>5 警告,>10 严重 |
| "非线性" / "线性假设" / "spline" | 调用 check_nonlinearity(X, y, feature_names) in _gate_utils.py:LR test 检验 |
| "MNAR" / "缺失不随机" / "敏感性分析" | 调用 mnar_sensitivity_analysis(...) in _gate_utils.py:δ-adjustment + tipping point |
| "时序漂移" / "校准漂移" / "concept drift" | 调用 temporal_drift_analysis(y_true, y_score, times) in _gate_utils.py:CUSUM 检测 |
| "Model Card" / "模型文档" | 调用 generate_model_card(...) in _gate_utils.py:自动生成 Markdown |
| "插补敏感性" / "换插补方法" | 调用 imputation_sensitivity(X_raw, y, estimator, features) in _gate_utils.py |
| "亚组 DCA" / "公平性净效用" | 调用 subgroup_dca(y_true, y_score, groups) in _gate_utils.py:equity gap |
| "baseline 对比" / "比随机好多少" | 调用 baseline_comparisons(y_true, y_score, y_pred) in _gate_utils.py:AUROC over random + BSS |
| "消融实验" / "ablation" / "去掉特征" | 调用 feature_ablation(estimator, X_train, y_train, X_test, y_test, features) in _gate_utils.py |
| "训练时间" / "计算资源" / "硬件" | 调用 compute_resource_report(t0, t1, model_name, n_train, n_features) in _gate_utils.py |
| "查看 lint 规则列表" | python3 scripts/mlgg.py lint rules |
| "评审一篇论文(从 metadata)" | python3 scripts/score_paper_metadata.py --metadata <metadata.json> |
| "批量评审论文" | python3 scripts/score_paper_metadata.py --batch-dir papers/ |
| "从 PMC 收集有代码的论文" | python3 experiments/paper/collect_papers_with_code.py --output <out.jsonl> |
| "验证论文 repo 质量" | python3 experiments/paper/verify_repos.py --input <in.jsonl> --output <out.jsonl> |
| "批量扫描论文代码泄漏" | python3 experiments/paper/scan_published_repos.py --manifest <verified.jsonl> --output <out.json> |
五条常用命令(覆盖 90% 场景)
# 1. 新手一键体验(推荐入口)
python3 scripts/mlgg.py play
# 2. 快速查看结果
python3 scripts/quick_summary.py ~/Desktop/MLGG_Output/breast_cancer
# 3. 下载真实数据集
python3 examples/download_real_data.py breast --output /tmp/breast.csv
# 4. 严格出版级流程
python3 scripts/mlgg.py onboarding && python3 scripts/mlgg.py workflow --strict
# 5. 环境诊断
python3 scripts/mlgg.py doctor添加新数据集的操作步骤
- 在
examples/download_real_data.py的URLS字典中添加下载 URL - 创建
prepare_<name>()函数(参考现有函数格式) - 调用
add_patient_id_and_time(df, seed=N)(种子必须唯一) - 输出列顺序:
patient_id, event_time, y, features... - 添加到
PREPARE字典和 CLIchoices - 在
scripts/mlgg_pixel.py中添加 i18n 字符串 +PLAY_DOWNLOAD_DATASETS条目 - 测试:
python3 examples/download_real_data.py <name> --output /tmp/test.csv
添加新模型族的操作步骤
修改 scripts/train_select_evaluate.py 的 5 个位置:
SUPPORTED_MODEL_FAMILIES集合_family_grid()— 超参数网格_build_estimator_for_family()— Pipeline 构建_family_base_complexity()— 复杂度排名_family_friendly_name()— 显示名称
修改 scripts/mlgg_pixel.py 的 4 个位置:
6. MODEL_POOL 列表
7. BASE_FAMILY_GRID_SIZES 字典
8. _T i18n 字符串
9. MODEL_PROFILE_PRESETS(balanced/comprehensive)
添加新 Gate 的操作步骤
所有 gate 脚本必须遵循统一 CLI 契约:
- CLI 参数:使用
add_common_arguments(parser)或手动添加--report、--strict、--timeout - 计时:入口调用
start_gate_timer() - 报告输出:使用
build_report_envelope()生成标准信封格式 - 终端输出:使用
print_gate_summary()打印结构化摘要 - 退出逻辑:
should_fail = bool(failures) or (args.strict and bool(warnings)),返回2 if should_fail else 0 - 注册:在
_gate_registry.py中注册 gate 名称和路径 - 无需手动同步 gate 列表:以下工具脚本已从
_gate_registry.py动态获取 gate 列表,添加新 gate 后自动生效:scripts/report_health_check.py→EXPECTED_REPORTSscripts/remediation_plan.py→GATE_ORDERscripts/evidence_digest.py→gate_filesscripts/compare_runs.py→REPORT_FILES- 仍需手动更新:
scripts/render_user_summary.py→DEFAULT_GATE_FILES(仅展示子集)、scripts/run_strict_pipeline.py→gate_script_inputs(manifest 指纹)
- 测试:在
tests/中创建对应测试文件,覆盖率 ≥85%
严禁:
- 自定义 strict-mode 逻辑(如
warning_is_blocking()过滤器) - 跳过
--strict对 warnings 的影响 - 手动提升 warnings 到 failures 列表(应由
should_fail逻辑统一处理)
添加新 Lint 规则 (R0xx) 的操作步骤
- 在
plugin/mlgg_lint/rules/创建r0xx_rule_name.py,继承BaseRule - 设置
id、name、severity、description、remediation、tags - 在
plugin/tests/samples/创建r0xx_bad.py(触发诊断)和r0xx_good.py(无诊断) - 在
plugin/tests/test_engine.py添加test_r0xx_bad_has_diagnostics()和test_r0xx_good_no_r0xx() - 运行
python3 -m pytest plugin/tests/test_engine.py -v验证
规则实现清单:每个新规则合并前必须同时提供 bad + good 测试样本。
常见错误恢复
| 错误信息 | 根因 | 修复 |
|---|---|---|
Unsupported model family |
新模型未加到 SUPPORTED_MODEL_FAMILIES |
更新白名单(见上方 5 个位置) |
candidate_pool_too_small |
候选模型少于 3 个 | 增加模型族或提高 --max-trials-per-family |
NaN to integer |
numpy 整数数组赋 NaN | 用 DataFrame.loc[mask, col] = np.nan |
| 训练超时(>20min) | 大数据集 + 多模型 + bootstrap | 减少模型数/trials/用保守预设 |
FileNotFoundError |
路径错误或前序步骤未执行 | 检查 data/ 目录下 CSV 是否存在 |
| R001 FP on utility files | 文件中无 train_test_split 但有 fit() | R001 已修复:skip_line is None 时跳过 (ERR-089) |
| R005 FP on unused thresholds | roc_curve 单变量捕获但未用 result[2] | R005 已修复:检查 index-2 access (ERR-090) |
| 空 metadata 通过验证 | validate_metadata({}) 返回 0 issues | 已修复:添加 REQUIRED 字段检查 (ERR-092) |
| BRFSS ZIP 文件名有空格 | CDC ZIP 中文件名尾部有空格 | 已修复:.strip() 处理 (ERR-098) |
| NCI GDC disease_type 是 list | API 返回 list 而非 string | 已修复:取 [0] 或 default (ERR-097) |
可用数据集清单(14 个,526K 行)
| 数据集 | 行数 | 来源 | 下载命令 | Gate 覆盖 |
|---|---|---|---|---|
| Sepsis Survival | 129K | UCI | download_real_data.py sepsis_survival |
C (39%) |
| Diabetes 130 Full | 102K | UCI | download_real_data.py diabetes130_full |
A (94%) |
| BRFSS 2022 | 100K | CDC | download_cdc_data.py brfss |
B (81%) |
| COVID-19 | 100K | CDC | download_cdc_data.py covid |
C (39%) |
| NHIS 2022 | 28K | CDC | download_cdc_data.py nhis |
A (94%) |
| NCI GDC Cancer | 25K | NCI/NIH | download_nci_gdc.py |
A (94%) |
| NHANES | 16K | CDC | download_nhanes.py --cycles both |
A (94%) |
| SUPPORT2 | 9K | Vanderbilt | 已下载 | A (94%) |
| RHC | 5.7K | Vanderbilt | download_real_data.py rhc |
A (94%) |
| 4 × UCI 小型 | <1K | UCI | download_real_data.py heart/breast/pima/ckd |
B (68-84%) |
Gate 覆盖: A=29/31可测, B=21-26/31, C=12/31。详见 references/dataset-gate-coverage-matrix.md。
Gate 严格性 Profile
| Profile | 适用场景 | EPV 下限 | 最小事件数 | L3 可达? |
|---|---|---|---|---|
standard |
N≥1000, 患病率≥10% | 10 | 100 | ✅ |
small_cohort |
N=200-1000 | 7 | 50 | ⚠️ 需注明 |
rare_disease |
N<200, 患病率<5% | 5 | 20 | ❌ |
exploratory |
可行性研究 | 5 | 20 | ❌ |
在 request.json 中指定: "thresholds": {"profile": "rare_disease"}。详见 references/gate-strictness-profiles.md。
数据泄漏 & 学术诚信检测覆盖
本项目的 33 道 gate 覆盖以下学术诚信风险:
数据泄漏检测(4 道 gate):
leakage_gate: 行级重叠、患者 ID 重叠、时序穿越(训练数据晚于测试数据)split_protocol_gate: 分割协议验证(患者不重叠、时序有序、种子锁定)definition_variable_guard: 表型定义中的未来信息泄漏(用未来事件定义当前标签)feature_lineage_gate: 特征来源链路追溯(特征是否包含标签信息或未来数据)
调优泄漏 / p-hacking(3 道 gate):
tuning_leakage_gate: 超参搜索是否使用了测试数据、模型选择数据源验证model_selection_audit_gate: 候选池大小、选择标准、是否存在选择偏倚evaluation_quality_gate: 主指标是否有 CI、是否优于基线(防止挑选性报告)
过拟合 & 泛化性(4 道 gate):
generalization_gap_gate: train-test 性能差距是否超过阈值covariate_shift_gate: 训练/测试特征分布是否漂移robustness_gate: 时间切片和分组的性能稳健性seed_stability_gate: 不同随机种子下结果是否稳定
统计严谨性(3 道 gate):
permutation_significance_gate: 置换检验 p-value(模型是否优于随机)ci_matrix_gate: Bootstrap CI 完整性(所有指标都有置信区间)prediction_replay_gate: 预测结果是否可精确重现(防止结果篡改)
临床有效性 & 报告完整性(3 道 gate):
calibration_dca_gate: 概率校准质量 + 决策曲线分析reporting_bias_gate: TRIPOD+AI / PROBAST+AI / STARD-AI 清单合规clinical_metrics_gate: 混淆矩阵一致性、完整临床指标面板
出版级聚合(2 道 gate):
publication_gate: 聚合所有 gate 结果 + 执行签名验证self_critique_gate: 全局质量评分 + 审稿人级自我批评
缺失值插补 & Pipeline 隔离
缺失值处理(train_select_evaluate.py):
SimpleImputer(默认):中位数填充 + 缺失指示器列IterativeImputer(MICE):多重迭代插补(--imputation-strategy mice)- 插补器在 sklearn Pipeline 内部,只在训练集上 fit,验证/测试集只做 transform
- 特征过滤阈值:strict 模式丢弃缺失率 >60% 的特征
Pipeline 隔离保证:
每个候选模型的 Pipeline 结构为 imputer → scaler → classifier:
- imputer 的统计量(中位数/参数)只从训练集计算
- scaler 的均值/标准差只从训练集计算
- classifier 只在训练集上拟合
- 验证/测试集只做 transform + predict,不影响任何参数
超参数搜索隔离(由 tuning_leakage_gate 强制检查):
model_selection_data: 只允许valid/cv_inner/nested_cv(禁止test)early_stopping_data: 只允许none/valid/cv_inner(禁止test)preprocessing_fit_scope: 必须是train_onlyfeature_selection_scope: 必须是train_onlyfinal_model_refit_scope: 只允许train_only/train_plus_valid_no_test
以上全部是 fail-closed 检查——违反任何一条即判定失败。
安全加固(Security Hardening)
本项目内置多层防御机制,覆盖以下攻击面:
模型工件安全:
- HMAC-SHA256 签名:训练完成后自动对
.pkl文件生成签名(.pkl.sig) - 安全加载:
SecureModelLoader在反序列化前验证签名,拒绝加载被篡改的模型 - 大小限制:模型文件超过 500MB 自动拒绝(防止 zip bomb 攻击)
证据完整性:
- 训练结束自动生成 SHA256 清单(
.manifest.json),记录每个证据文件的哈希值和大小 - 可随时验证:
python3 scripts/_security.py audit evidence/ - 检测篡改、缺失、敏感数据暴露
输入验证:
safe_path()/resolve_path(): 路径穿越防护(null byte 注入、..逃逸、系统目录封锁、沙箱Path.relative_to()强制检查)safe_load_json(): JSON 大小限制(100MB)+ 嵌套深度限制(50层)防止栈溢出/内存耗尽check_csv_row_limit(): CSV 行数限制防止内存耗尽 DoS
密码学安全:
- 所有 HMAC/签名比较必须使用
hmac.compare_digest()(常量时间比较,防止计时攻击) - 禁止使用
==/!=进行任何密码学值比较
隐私防护:
perturb_predictions(): Laplace 机制扰动预测概率,防御成员推理攻击- 敏感数据扫描:审计工具自动扫描证据文件中的 API key / password / token / PEM 私钥 / 医疗标识符(MRN/insurance_id)等
供应链验证:
verify_critical_imports(): 运行时验证 sklearn/numpy/pandas 是否为真实库(非 monkey-patch).mlgg_model_key自动生成、权限 600、已加入.gitignore
CLI 工具:python3 scripts/_security.py [sign|verify|manifest|audit|check-deps]
能力边界
能做的:
- 表格型医学二分类预测(EHR/临床/注册数据)
- 自动防泄漏分割 + 模型训练 + 评估 + 出版级审计
- 9 个真实数据集 + 自定义 CSV(支持中文列名)
- 20 个 sklearn 模型族 + 4 个可选后端
- 安全加固:HMAC 签名 + 证据清单 + 路径穿越防护 + 成员推理防御
做不了的:
- 图像/文本/时序等非表格数据
- 多分类/回归任务(仅二分类)
- 深度学习模型(TabNet/Transformer 等)
- 模型部署/API serving
- 交互式可视化 dashboard
Objective (Goal Clarity)
Solve one narrow problem: produce leakage-safe, publication-grade medical prediction evidence.
Success is binary:
pass: all hard gates pass and self-critique score reaches threshold.fail: any hard gate fails or strict review conditions are not met.
Never produce publication-grade claims without machine-checkable evidence artifacts.
Input Contract (Structured Input)
Accept a structured request JSON, not free-form text.
Data input modes:
- Pre-split mode: user provides separate train/valid/test CSV files.
- Single-file mode: user provides one complete CSV; use
scripts/split_data.pyto auto-split with patient-level disjoint, temporal ordering, and prevalence checks. The interactive wizard (mlgg interactive --command train) and onboarding (mlgg onboarding --input-csv) support this mode natively.
Required fields:
study_idrun_idtarget_nameprediction_unitindex_time_collabel_colpatient_id_colprimary_metricclaim_tier_target(leakage-auditedorpublication-grade)phenotype_definition_specsplit_paths.trainsplit_paths.test
Publication-grade required fields:
feature_lineage_specfeature_group_specsplit_protocol_specimbalance_policy_specmissingness_policy_spectuning_protocol_specperformance_policy_specreporting_bias_checklist_specexecution_attestation_specmodel_selection_report_filefeature_engineering_report_filedistribution_report_filerobustness_report_fileseed_sensitivity_report_fileevaluation_report_fileprediction_trace_fileexternal_cohort_specexternal_validation_report_fileci_matrix_report_fileevaluation_metric_pathpermutation_null_metrics_fileactual_primary_metricprimary_metricmust bepr_aucfor publication-grade strict mode.evaluation_metric_pathterminal token must matchprimary_metric(after normalization).
Optional threshold keys under thresholds:
alphaandmin_deltafor permutation significance gate.min_baseline_delta,ci_min_resamples, andci_max_widthfor evaluation quality gate.
Path semantics:
- All relative paths in request JSON are resolved relative to the request file directory.
Template:
references/request-schema.example.jsonreferences/feature-lineage.example.jsonreferences/split-protocol.example.jsonreferences/imbalance-policy.example.jsonreferences/missingness-policy.example.jsonreferences/tuning-protocol.example.jsonreferences/performance-policy.example.jsonreferences/external-cohort-spec.example.jsonreferences/reporting-bias-checklist.example.jsonreferences/execution-attestation.example.jsonreferences/attestation-payload.example.jsonreferences/key-revocations.example.jsonreferences/attestation-timestamp-record.example.jsonreferences/attestation-transparency-record.example.jsonreferences/attestation-execution-receipt-record.example.jsonreferences/attestation-execution-log-record.example.jsonreferences/attestation-witness-record.example.jsonreferences/evaluation-report.example.jsonreferences/external-validation-report.example.jsonreferences/prediction-trace.example.csv
Validate request first:
python3 scripts/request_contract_gate.py \
--request configs/request.json \
--report evidence/request_contract_report.json \
--strictHidden Workflow (Internal, Fail-Closed)
Use this internal sequence in order:
- Validate request contract.
- Lock data/config fingerprints (
manifest_lock.py). - Run execution attestation gate (
execution_attestation_gate.py). - Run split/time leakage gate (
leakage_gate.py). - Run split protocol gate (
split_protocol_gate.py). - Run covariate-shift gate (
covariate_shift_gate.py). - Run reporting/bias checklist gate (
reporting_bias_gate.py). - Run phenotype-definition leakage gate (
definition_variable_guard.py). - Run lineage leakage gate (
feature_lineage_gate.py). - Run imbalance policy gate (
imbalance_policy_gate.py). - Run missingness policy gate (
missingness_policy_gate.py). - Run tuning leakage gate (
tuning_leakage_gate.py). - Run model-selection audit gate (
model_selection_audit_gate.py). - Run feature-engineering audit gate (
feature_engineering_audit_gate.py). - Run clinical-metrics gate (
clinical_metrics_gate.py). - Run prediction-replay gate (
prediction_replay_gate.py). - Run distribution-generalization gate (
distribution_generalization_gate.py). - Run generalization-gap gate (
generalization_gap_gate.py). - Run robustness gate (
robustness_gate.py). - Run seed-stability gate (
seed_stability_gate.py). - Run external-validation gate (
external_validation_gate.py). - Run calibration+DCA gate (
calibration_dca_gate.py). - Run CI-matrix gate (
ci_matrix_gate.py). - Run metric consistency gate (
metric_consistency_gate.py). - Run evaluation quality gate (
evaluation_quality_gate.py). - Run permutation falsification gate (
permutation_significance_gate.py). - Aggregate publication gate (
publication_gate.py). - Run self-critique scoring gate (
self_critique_gate.py). - Run security audit gate (
security_audit_gate.py). - Run fairness & equity gate (
fairness_equity_gate.py). - Run sample size adequacy gate (
sample_size_gate.py). - Emit final report only if all strict gates pass.
Treat execution-attestation failures (signature/fingerprint/key-revocation/timestamp/transparency/execution-receipt/execution-log/witness-quorum/cross-role-authority-distinctness), disease-definition leakage, lineage ambiguity, metric-source ambiguity, split protocol violations, covariate-shift anomalies, class-imbalance misuse, missingness/imputation misuse, and tuning/test leakage as critical failures in strict mode.
Output Contract (Machine-Parseable)
Produce these deterministic artifacts:
evidence/request_contract_report.jsonevidence/manifest.jsonevidence/execution_attestation_report.jsonevidence/reporting_bias_report.jsonevidence/leakage_report.jsonevidence/split_protocol_report.jsonevidence/covariate_shift_report.jsonevidence/definition_guard_report.jsonevidence/lineage_report.jsonevidence/imbalance_policy_report.jsonevidence/missingness_policy_report.jsonevidence/tuning_leakage_report.jsonevidence/model_selection_audit_report.jsonevidence/feature_engineering_audit_report.jsonevidence/clinical_metrics_report.jsonevidence/prediction_replay_report.jsonevidence/distribution_generalization_report.jsonevidence/generalization_gap_report.jsonevidence/robustness_gate_report.jsonevidence/seed_stability_report.jsonevidence/external_validation_gate_report.jsonevidence/calibration_dca_report.jsonevidence/ci_matrix_gate_report.jsonevidence/metric_consistency_report.jsonevidence/evaluation_quality_report.jsonevidence/permutation_report.jsonevidence/publication_gate_report.jsonevidence/self_critique_report.jsonevidence/security_audit_gate_report.jsonevidence/fairness_equity_report.jsonevidence/sample_size_report.jsonevidence/dag_pipeline_report.json
Report status from each file must be machine-readable (pass or fail) with issue codes.
Quality Control (Self-Critique)
Do not stop at initial gate pass.
Run self_critique_gate.py to score evidence quality and produce recommendations.
Publication-grade readiness requires:
- Strict-mode component reports.
- No blocking failures.
- Self-critique score at or above threshold (default
95).
Composability (Workflow Node Ready)
Each script is a composable node:
- Deterministic CLI interface.
- Deterministic JSON output.
- Deterministic exit code (
0pass,2fail).
Use one-command orchestration for production use:
python3 scripts/run_strict_pipeline.py \
--request configs/request.json \
--evidence-dir evidence \
--compare-manifest evidence/manifest_baseline.json \
--strictProductized one-command wrapper:
python3 scripts/run_productized_workflow.py \
--request configs/request.json \
--evidence-dir evidence \
--allow-missing-compare \
--strictNovice onboarding wrapper (guided 8-step flow):
python3 scripts/mlgg.py onboarding \
--project-root /tmp/mlgg_demo \
--mode guided \
--yesOnboarding contract:
scripts/mlgg_onboarding.pyis strict-only (no policy downgrade path).- Failure behavior:
- default
--stop-on-fail(fail-fast) - optional
--no-stop-on-fail(collect full diagnostics while keeping fail-closed result) - guided mode without interactive stdin fails closed with
onboarding_interactive_input_unavailable(use--yesor--mode auto) - wrapper route-conflict failure code:
authority_preset_route_override_forbidden
- default
- Modes:
guided: step-by-step command preview + confirmation.preview: print the full 8-step command plan only; report includespreview_only=trueanddisplay_status=preview.auto: execute all steps non-interactively.
- Step order is fixed:
env_doctor.pyinit_project.pygenerate_demo_medical_dataset.py- config alignment to demo schema (
request/lineage/group/external spec) train_select_evaluate.pygenerate_execution_attestation.py(+ keypair bootstrap if needed)run_productized_workflow.py --strict --allow-missing-comparerun_productized_workflow.py --strict --compare-manifest ...
- Required report:
evidence/onboarding_report.json(contract_version=onboarding_report.v2)- report fields include
stop_on_fail,termination_reason,failure_codes,next_actions,copy_ready_commands,preview_only,display_status copy_ready_commandsuses absolutemlgg.pypath so commands are runnable from any working directory.
- Offline demo data artifacts:
data/train.csv,data/valid.csv,data/test.csvdata/external_2025_q4.csv(cross_period)data/external_site_b.csv(cross_institution)
This wrapper runs:
env_doctor.pyschema_preflight.pyrun_strict_pipeline.pyrender_user_summary.py
For first-run baseline bootstrap, you may omit --compare-manifest only with:
--allow-missing-comparerun_strict_pipeline.pyalways enforces--strictfor publication-grade execution.--allow-missing-compareis bootstrap-only for artifact generation; publication-grade readiness still fails until baseline manifest comparison exists.run_strict_pipeline.pyis publication-grade only; non-publication claim tiers are rejected.
Personal UX Quickstart (Signed Attestation)
Create keypair once:
mkdir -p keys
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:3072 -out keys/attestation_priv.pem
openssl pkey -in keys/attestation_priv.pem -pubout -out keys/attestation_pub.pem
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:3072 -out keys/timestamp_priv.pem
openssl pkey -in keys/timestamp_priv.pem -pubout -out keys/timestamp_pub.pem
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:3072 -out keys/execution_priv.pem
openssl pkey -in keys/execution_priv.pem -pubout -out keys/execution_pub.pem
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:3072 -out keys/execution_log_priv.pem
openssl pkey -in keys/execution_log_priv.pem -pubout -out keys/execution_log_pub.pem
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:3072 -out keys/witness_a_priv.pem
openssl pkey -in keys/witness_a_priv.pem -pubout -out keys/witness_a_pub.pem
openssl genpkey -algorithm RSA -pkeyopt rsa_keygen_bits:3072 -out keys/witness_b_priv.pem
openssl pkey -in keys/witness_b_priv.pem -pubout -out keys/witness_b_pub.pemGenerate payload + signature + spec in one command:
python3 scripts/generate_execution_attestation.py \
--study-id sepsis-risk-icu-v1 \
--run-id sepsis-risk-icu-v1-train-2026-02-24-001 \
--payload-out evidence/attestation_payload.json \
--signature-out evidence/attestation.sig \
--spec-out configs/execution_attestation.json \
--private-key-file keys/attestation_priv.pem \
--public-key-file keys/attestation_pub.pem \
--timestamp-private-key-file keys/timestamp_priv.pem \
--timestamp-public-key-file keys/timestamp_pub.pem \
--execution-private-key-file keys/execution_priv.pem \
--execution-public-key-file keys/execution_pub.pem \
--execution-log-private-key-file keys/execution_log_priv.pem \
--execution-log-public-key-file keys/execution_log_pub.pem \
--require-independent-timestamp-authority \
--require-independent-execution-authority \
--require-independent-log-authority \
--require-witness-quorum \
--min-witness-count 2 \
--require-independent-witness-keys \
--require-witness-independence-from-signing \
--witness "witness-a|keys/witness_a_pub.pem|keys/witness_a_priv.pem" \
--witness "witness-b|keys/witness_b_pub.pem|keys/witness_b_priv.pem" \
--command "python train.py --config configs/train_config.json --seed 42" \
--artifact training_log=evidence/train.log \
--artifact training_config=configs/train_config.json \
--artifact model_artifact=models/model_v1.bin \
--artifact evaluation_report=evidence/evaluation_report.json \
--artifact prediction_trace=evidence/prediction_trace.csv.gz \
--artifact external_validation_report=evidence/external_validation_report.jsonThis command also creates:
configs/key_revocations.json(bootstrapped if missing)evidence/attestation_timestamp_record.json+.sigevidence/attestation_transparency_record.json+.sigevidence/attestation_execution_receipt_record.json+.sigevidence/attestation_execution_log_record.json+.sigevidence/attestation_witness_record_1.json+.sigevidence/attestation_witness_record_2.json+.sig
Manual Strict Execution Order
If orchestration is unavailable, run in this exact order:
request_contract_gate.pymanifest_lock.py(with optional--compare-with)execution_attestation_gate.pyleakage_gate.pysplit_protocol_gate.pycovariate_shift_gate.pyreporting_bias_gate.pydefinition_variable_guard.pyfeature_lineage_gate.pyimbalance_policy_gate.pymissingness_policy_gate.pytuning_leakage_gate.pymodel_selection_audit_gate.pyfeature_engineering_audit_gate.pyclinical_metrics_gate.pyprediction_replay_gate.pydistribution_generalization_gate.pygeneralization_gap_gate.pyrobustness_gate.pyseed_stability_gate.pyexternal_validation_gate.pycalibration_dca_gate.pyci_matrix_gate.pymetric_consistency_gate.pyevaluation_quality_gate.pypermutation_significance_gate.pypublication_gate.pyself_critique_gate.pysecurity_audit_gate.pyfairness_equity_gate.pysample_size_gate.py
Note: Steps 30-31 run in METRIC_VALIDATION layer (parallel with steps 16-26 in DAG mode). In manual sequential mode, run them after step 29 to ensure all dependencies are available.
If any step returns non-zero, stop and block claim release.
Medical Non-Negotiable Rules
- Never tune on test data.
- Never fit preprocessors on combined train+validation+test.
- Never apply resampling/SMOTE on validation or test splits.
- Never select thresholds or calibrate probabilities on test split.
- Never fit imputers on validation/test distributions.
- Never use target/outcome information for feature imputation.
- Never run MICE at oversized scale without audited fallback evidence (
mice_with_scale_guard). - Never ignore severe train-vs-holdout distribution separability without explicit mitigation and downgrade.
- Never perform model ranking/selection with any test-derived signal.
- Never release without full split-level clinical metrics (accuracy/precision/PPV/NPV/sensitivity/specificity/F1/F2-beta/ROC-AUC/PR-AUC/Brier).
- Never ignore train/valid/test gap breaches beyond configured fail thresholds.
- Never claim publication-grade without signed execution attestation proving run command, timing, and artifact hashes.
- Never reuse revoked/expired/over-age signing keys for publication-grade claims.
- Never omit trusted timestamp or transparency-log records for publication-grade claims.
- Never omit signed execution-receipt proof (with exit code and timing consistency) for publication-grade claims.
- Never omit signed execution-log attestation binding
training_logto payload hash for publication-grade claims. - Never omit witness-quorum evidence with independent witness keys and minimum validated witness count for publication-grade claims.
- Never claim publication-grade if TRIPOD+AI/PROBAST+AI checklist has unmet required items.
- Never accept publication-grade primary metrics from non-test evaluation splits; evaluation report must explicitly declare
split=test. - Never claim publication-grade without valid primary-metric confidence interval and explicit baseline comparison in the evaluation artifact.
- Never include variables used to define the disease label as model predictors.
- Never include derived features whose lineage contains disease-defining variables.
- Never include post-index features for pre-index prediction tasks.
- Never report point estimates without uncertainty and robustness checks.
- Never claim causality from predictive associations.
- Never publish subgroup predictions without fairness/equity assessment (equalized odds, disparate impact).
- Never claim adequate sample size without EPV ≥ 10 justification (Riley et al. 2019).
- Never omit IDI/NRI when comparing against baseline models for top-tier journals.
- Never use ICD diagnostic codes from the same admission as predictors without verifying temporal precedence.
- Never claim TRIPOD+AI adherence without the 2024 expanded 27-item checklist (BMJ 2024;385:e078378).
Resources
scripts/
scripts/run_strict_pipeline.py: single-entry strict orchestrator.scripts/request_contract_gate.py: request schema/path validation and publication-policy anti-downgrade checks.scripts/mlgg.py: unified command entrypoint (onboarding,interactive,init,train,workflow, ...).scripts/mlgg_onboarding.py: novice-guided strict onboarding flow and report emitter.scripts/split_data.py: split a single CSV into train/valid/test with patient-level disjoint, temporal ordering, prevalence safety checks, NaN patient_id/target exclusion, row count preservation, SHA256 input fingerprint, min 10 pos/neg per split, min 5 patients per split, and prevalence shift warning.scripts/generate_demo_medical_dataset.py: offline reproducible demo dataset generator.scripts/manifest_lock.py: dataset/protocol/evaluation/gate-script fingerprint and baseline comparison.scripts/execution_attestation_gate.py: signed run-attestation and artifact-hash verification gate.scripts/generate_execution_attestation.py: one-command payload/signature/spec/timestamp/transparency/execution-receipt/execution-log/witness-quorum generator for personal users.scripts/reporting_bias_gate.py: TRIPOD+AI / PROBAST+AI / STARD-AI checklist hard gate.scripts/leakage_gate.py: split contamination, ID overlap, and temporal boundary checks.scripts/split_protocol_gate.py: enforce split protocol consistency and temporal/group safeguards.scripts/covariate_shift_gate.py: train-vs-holdout covariate-shift and split separability risk gate.scripts/definition_variable_guard.py: hard gate against disease-definition variable leakage.scripts/feature_lineage_gate.py: hard gate against lineage-derived leakage.scripts/imbalance_policy_gate.py: validate class-imbalance strategy and train-only resampling policy.scripts/missingness_policy_gate.py: validate missing-data strategy, large-scale method suitability, and imputer isolation policy.scripts/tuning_leakage_gate.py: validate hyperparameter tuning/test-isolation protocol.scripts/model_selection_audit_gate.py: validate candidate pool, one-SE replay, and test-isolated model selection.scripts/feature_engineering_audit_gate.py: validate feature-group provenance, train-only engineering scope, stability evidence, and reproducibility fields.scripts/clinical_metrics_gate.py: validate clinical metric completeness and confusion-matrix consistency per split.scripts/distribution_generalization_gate.py: train-vs-holdout distribution shift, split separability, and transport-readiness gate.scripts/generalization_gap_gate.py: fail-closed overfitting gap checks across train/valid/test.scripts/ci_matrix_gate.py: bootstrap CI matrix gate for primary metric and transport-drop CI on internal and external cohorts.scripts/metric_consistency_gate.py: extract and validate metric from evaluation report.scripts/evaluation_quality_gate.py: enforce primary-metric CI quality and baseline improvement checks.scripts/permutation_significance_gate.py: falsification significance gate.scripts/publication_gate.py: aggregate fail-closed publication gate.scripts/self_critique_gate.py: quality scoring and reviewer-grade self-critique gate.scripts/train_select_evaluate.py: terminal-ready training, model selection, threshold selection, and evaluation artifact generator.scripts/train_select_evaluate.pymodel-pool controls:--model-pool,--include-optional-models,--max-trials-per-family,--hyperparam-search,--n-jobs.scripts/train_select_evaluate.pyoptional model backends:xgboostandcatboostare auto-detected and fail-closed when explicitly requested but unavailable.scripts/init_project.py: one-command initialization forconfigs/,data/,evidence/,models/,keys/, plusconfigs/request.json.scripts/schema_preflight.py: train/valid/test schema checks with semantic column auto-mapping report.scripts/env_doctor.py: dependency and environment diagnostics with optional-backend checks.scripts/render_user_summary.py: user-facing markdown/json summary from strict evidence artifacts.scripts/run_productized_workflow.py: full UX wrapper (doctor -> preflight -> strict pipeline -> user summary).scripts/mlgg_interactive.py: terminal interactive wizard for core commands (init/workflow/train/authority) with command preview, confirm-before-run, and profile save/load.scripts/mlgg_pixel.py: pixel-art interactive CLI wizard (mlgg.py play) for guided pipeline setup and execution with bilingual (en/zh) support, dataset-size-aware defaults, small-sample strict mode, and play-mode quick-readiness card.scripts/_gate_utils.py: shared utility functions (add_issue,load_json,write_json,to_float) for gate scripts.scripts/_security.py: security hardening module — HMAC model signing, path traversal protection, secure JSON loading, artifact integrity manifest, membership inference defense, dependency verification, security audit CLI.scripts/security_audit_gate.py: 29th pipeline gate (FINAL layer) — verifies model HMAC signatures, evidence manifest integrity, dependency authenticity, file permissions, sensitive data exposure, artifact sizes.scripts/fairness_equity_gate.py: 30th pipeline gate (METRIC_VALIDATION layer) — equalized odds gap across demographic/clinical subgroups, disparate impact ratio (four-fifths rule), per-subgroup PR-AUC validation.scripts/sample_size_gate.py: 31st pipeline gate (METRIC_VALIDATION layer) — EPV (Riley et al. 2019/2025), shrinkage factor, minimum events/non-events adequacy.scripts/policy_generator.py: generate recommendedperformance_policy.jsonfrom evidence reports with configurable margin and presets.scripts/gate_timeline.py: analyze gate execution timeline, identify bottleneck gates, compute wall-clock span.scripts/gate_coverage_matrix.py: scan evidence directory against full gate registry to produce coverage matrix.scripts/evidence_comparator.py: compare two evidence directories side-by-side showing improved/regressed/new/removed gates.scripts/evidence_digest.py: generate compact one-page summary from evidence directory.scripts/report_health_check.py: scan all gate reports for completeness and pass rate.scripts/remediation_plan.py: generate prioritized remediation plan from gate failures.scripts/threshold_sensitivity.py: analyze how close metrics sit to pass/fail thresholds.scripts/compare_runs.py: compare two pipeline runs side-by-side.scripts/export_latex.py: generate LaTeX tables from evaluation/CI/model-selection reports.scripts/explain_gate.py: explain a single gate result in human-readable form.scripts/quick_summary.py: one-command training results viewer with key metrics, overfitting risk, model selection top-10.scripts/audit_external_project.py: 10-dimension quantitative audit tool for evaluating medical ML projects (100-point scale) with journal-specific gap analysis.scripts/fairness_equity_gate.py: fail-closed fairness and equity gate — equalized odds gap, disparate impact ratio (four-fifths rule), per-subgroup PR-AUC validation.scripts/sample_size_gate.py: fail-closed sample size adequacy gate — EPV (Riley et al. 2019/2025), shrinkage factor, min events/non-events.scripts/batch_journal_review.py: batch audit N projects in parallel with comparison matrix, cross-cutting analysis, and aggregated remediation priorities.experiments/authority-e2e/scan_stress_diabetes_feasibility.py: stress-case diabetes feasibility scanner across target modes and row caps; outputs a fail-closed feasibility report.
plugin/
plugin/mlgg_lint/: AST-based static analysis for ML Python code (10 rules: R001–R010, 57 tests).- R001 fit-before-split (ERROR), R002 scaler-on-test (ERROR), R003 resample-on-test (ERROR), R004 split-without-group (WARNING), R005 threshold-on-test (ERROR), R006 feature-selection-on-full (ERROR), R007 target-as-feature (ERROR), R008 temporal-split-shuffle (WARNING), R009 no-confidence-intervals (INFO), R010 train-metric-as-final (WARNING).
- Detection: keyword args (
fit(X=X_test)), chained calls (SMOTE().fit_resample()), DataFrame origin tracking +.drop()re-assignment, Pipeline exclusion, word-boundary variable classification. - CLI:
python3 scripts/mlgg.py lint check [--format text|json|sarif] [--exit-code] [--severity warning] [--disable R004,R008] PATH... - Supports
# noqa: R001/# noqainline suppression and.mlgg-lint.tomlconfig auto-discovery. - Output: relative paths (no absolute path leakage), ANSI-stripped in no-color mode.
- Security: 16 MB file limit, 1 MB config limit, symlink skip, stat-error handling, malformed TOML graceful fallback.
- VS Code extension at
plugin/vscode/(SARIF-based diagnostics on save/open). - Pre-commit hook at
plugin/.pre-commit-hooks.yaml.
examples/
examples/download_real_data.py: download and prepare 9 real medical datasets (UCI/PhysioNet/GitHub) + 2 synthetic generators.- Real datasets: heart(297), breast(569), pima(768), mammographic(961), framingham(4240), vitaldb(6388), thyroid(7200), diabetes130(10000), eeg_eye(14980).
- All produce pipeline-ready CSV with
patient_id,event_time,ycolumns.
tests/
tests/: 2905+ pytest unit tests covering all gate scripts and analysis tools.- Direct
main()tests for 20+ gate scripts (bypass subprocess for in-process coverage). - All gate modules ≥86% coverage; publication_gate 97%, evaluation_quality_gate 94%.
- Run:
python3 -m pytest tests/ -q --tb=short(~10 min for full suite).
- Direct
references/
references/Beginner-Quickstart.md: bilingual novice quickstart (minimal loop + publication-grade loop).references/Troubleshooting-Top20.md: high-frequency failure code to diagnosis/fix/verify mapping.references/request-schema.example.json: structured request template.references/feature-lineage.example.json: lineage map template.references/split-protocol.example.json: split protocol template.references/imbalance-policy.example.json: class-imbalance policy template.references/missingness-policy.example.json: missing-data/imputation policy template.references/tuning-protocol.example.json: hyperparameter tuning protocol template.references/performance-policy.example.json: metric panel/threshold/gap policy template.references/reporting-bias-checklist.example.json: TRIPOD+AI / PROBAST+AI / STARD-AI checklist template.references/execution-attestation.example.json: signed execution-attestation spec template.references/attestation-payload.example.json: signed payload template with artifact hashes.references/key-revocations.example.json: key revocation list template.references/attestation-timestamp-record.example.json: trusted timestamp record template.references/attestation-transparency-record.example.json: transparency log record template.references/attestation-execution-receipt-record.example.json: execution receipt record template.references/attestation-execution-log-record.example.json: execution-log attestation record template.references/attestation-witness-record.example.json: witness attestation record template.references/feature-group-spec.example.json: feature group specification template (groups, train-only scope).references/feature-engineering-report.example.json: feature-engineering audit report template.references/distribution-report.example.json: distribution/shift report template.references/ci-matrix-report.example.json: CI matrix report template.references/external-validation-report.example.json: external validation report template.references/evaluation-report.example.json: evaluation metrics report template.references/interactive-profile.example.json: interactive CLI profile contract example (contract_version/command/saved_at_utc/argument_values/python/cwd).references/benchmark-registry.json: frozen benchmark dataset registry (contractbenchmark_registry.v1).references/stress-seed-search-report.v2.example.json: stress seed/profile search contract template.references/medical-disease-leakage.md: medical phenotype leakage patterns and controls.references/leakage-taxonomy.md: leakage classes, red flags, and mitigations.references/top-tier-rigor-checklist.md: submission-grade hard gates.references/external-benchmark-comparison.md: external tool/guideline comparison and gap map.references/release-benchmark-suite.md: structured benchmark profile matrix and pass contract.references/report-template.md: reporting template for methods/results/robustness.references/error-knowledge-base.json: self-improving error pattern database with 25 known patterns, agent-appendable.references/journal-rigor-standards.json: top-tier journal requirements mapped to gates (Nature Medicine, Lancet DH, JAMA, BMJ, npj DM).references/literature-knowledge-base.json: curated top-journal literature database (30 entries, LIT-001–LIT-030), searchable by category/gate/dimension.references/mlgg-review-standard.json: independent MLGG Medical ML Review Standard — 10 dimensions × 73 criteria across 3 review levels (quick/standard/comprehensive).references/batch-manifest.example.json: batch manifest template for multi-project review.
Authority E2E Execution Notes
Recommended single-entry CLI:
python3 scripts/mlgg.py <command> [command-args]- Examples:
python3 scripts/mlgg.py init --project-root /tmp/mlgg_demopython3 scripts/mlgg.py train --interactivepython3 scripts/mlgg.py interactive --command workflow --profile-name demo --save-profilepython3 scripts/mlgg.py workflow --request /tmp/mlgg_demo/configs/request.json --strict --allow-missing-comparepython3 scripts/mlgg.py authority --include-stress-casespython3 scripts/mlgg.py benchmark-suite --profile release(recommended multi-dataset stability verdict)python3 scripts/mlgg.py benchmark-suite --profile release --repeat 3 --registry-file references/benchmark-registry.jsonpython3 scripts/mlgg.py authority-release(recommended release stress path)python3 scripts/mlgg.py authority-research-heart --stress-seed-min 20250003 --stress-seed-max 20250060(research/high-pressure mode)- preset wrappers are fixed-route; conflicting route flags are rejected fail-closed
- add
--error-jsonfor machine-readable failures (contract_version=mlgg_error.v1)
New-user order of operations:
init-> place split CSVs ->train(emit required evidence artifacts) ->workflow --strict --allow-missing-compare.- Follow-up reproducible runs should pass
--compare-manifest <project>/evidence/manifest_baseline.bootstrap.json.
Interactive wizard defaults:
- Supports
init/workflow/train/authority. - Preview command before execution, then require one confirm step.
- Train wizard defaults
--include-optional-modelsto off; enable manually only when optional backends are installed. - Train wizard defaults
--n-jobsto1for cross-platform stability; increase manually for multi-core runs. - Train wizard default artifact outputs are auto-scoped to split project base (
<project>/evidence) inferred from train split path. - Train wizard emits
--external-validation-report-outonly whenexternal_cohort_specis provided. - Train wizard emits
--feature-engineering-report-outonly whenfeature_group_specis provided. - Profile reuse:
--profile-name <name> --save-profile--profile-name <name> --load-profile--accept-defaultsfor non-blocking execution with defaults/profile values
- Profile path defaults to
~/.mlgg/profiles(override with--profile-dir). - For workflow wizard,
--strictis always injected and cannot be bypassed by interactive mode. - Workflow wizard first-run default enables
--allow-missing-comparewhen no baseline manifest is provided/found. - Workflow wizard now auto-suggests evidence output under request project base (
<project>/evidencewhen request is underconfigs/). - Authority wizard now defaults to release-grade stress path (
--include-stress-cases --stress-case-id uci-chronic-kidney-disease);
selectinguci-heart-diseaseis treated as advanced research/high-pressure mode.
- Supports
Use isolated output paths in concurrent runs:
--summary-file--stress-seed-cache-file--stress-selection-file
Optional benchmark case switches:
--include-ckd-case(UCI Chronic Kidney Disease)--include-large-cases(Diabetes130 large-cohort path)--diabetes-target-mode {lt30,gt30,any}and--diabetes-max-rows
Stress dataset selection:
--stress-case-id {uci-diabetes-130-readmission,uci-heart-disease,uci-chronic-kidney-disease,uci-breast-cancer-wdbc}- default is
uci-chronic-kidney-disease(most stable publication-grade stress path in current benchmark set)
Release benchmark blocking suites are
authority_release_core+adversarial_fail_closed;authority_release_extended(Diabetes130) is kept as observational/non-blocking in release profile.Non-blocking authority failures are summarized as
observational_diagnosticsin matrix report and written to*.observational_diagnostics.jsonsidecar.Case-specific training configuration is enabled in authority E2E:
- larger cohorts (e.g., Diabetes130) use expanded model pool (includes
xgboostwhen installed), highermax-trials-per-family, and multi-core--n-jobs.
- larger cohorts (e.g., Diabetes130) use expanded model pool (includes
Use
--run-tagto bind all generated stress artifacts to a unique execution token.Stress seed-search profile bundles are selected with
--stress-profile-set(defaultstrict_v1).--stress-seed-searchapplies only to--stress-case-id uci-heart-disease; other stress cases run without seed search.CI coverage:
.github/workflows/ci-smoke.yml(push/PR/workflow_dispatch).github/workflows/ci-full.yml(nightly/workflow_dispatch release blocking benchmark-suite).github/workflows/ci-extended.yml(weekly/workflow_dispatch extended observational benchmark-suite)
Optional diabetes feasibility auto-scan on failure:
--auto-scan-diabetes-feasibility--diabetes-feasibility-target-modes--diabetes-feasibility-max-rows-options--diabetes-feasibility-summary-dir--diabetes-feasibility-report-file
Summary rows now include strict-pipeline root-cause fields for failed cases:
root_failure_code_primaryroot_failure_codesfailed_steps
Summary rows now also include
clinical_floor_gap_summarywith internal/external floor margins
(observed - required_min) forsensitivity/npv/specificity/ppv.stress_seed_search_reportv2 contract requires:contract_versionrun_tagpolicy_sha256search_profile_setselected_profiledataset_fingerprintcode_revision_hint
Deep Review Fix Log
Session 1 (Fixes applied to request_contract_gate.py, train_select_evaluate.py)
Fix 1 — request_contract_gate.py: wrong error code in validate_feature_engineering_report_shape
- The
exceptblock for JSON parse failure usedfeature_group_spec_missing_or_invalidinstead offeature_engineering_report_invalid. - Fixed: error code now correctly reflects
feature_engineering_report_invalid.
Fix 2 — train_select_evaluate.py: misleading hard-coded CI bounds in transport_drop_ci
ci_95andci_widthin the transport drop block were hard-coded to[0.0, 0.0]/0.0, falsely implying CIs were bootstrapped.- Fixed: replaced with
nulland addedci_note: "not_computed_point_estimate_only". - Verified:
ci_matrix_gate.pyindependently recomputes these CIs from prediction traces; downstream not affected.
Session 2 (Fixes applied to feature_engineering_audit_gate.py, generalization_gap_gate.py, robustness_gate.py, seed_stability_gate.py)
Fix 3 — feature_engineering_audit_gate.py: wrong error code for feature_engineering_report parse failure
- Mirror of Fix 1: the
exceptblock usedfeature_group_spec_missing_or_invalidwhen parsingfeature_engineering_reportJSON. - Fixed: error code now correctly set to
feature_engineering_report_invalid.
Fix 4 — feature_engineering_audit_gate.py: to_float missing math.isfinite guard
to_floatacceptedinfandnanas valid float values, inconsistent with all other gate scripts.- Fixed: added
math.isfiniteguard and addedimport math.
Fix 5 — generalization_gap_gate.py: finish() ignored --strict for warning escalation
should_fail = bool(failures)silently swallowed warnings even in strict mode.- Fixed:
should_fail = bool(failures) or (args.strict and bool(warnings)).
Fix 6 — robustness_gate.py: same strict-mode bug as Fix 5
- Fixed:
should_fail = bool(failures) or (args.strict and bool(warnings)).
Fix 7 — seed_stability_gate.py: same strict-mode bug as Fix 5
- Fixed:
should_fail = bool(failures) or (args.strict and bool(warnings)).
Verified clean (no bugs found)
execution_attestation_gate.py:finish()already correct; all validation logic and key/timestamp/transparency/receipt/log/witness-quorum checks are robust.generalization_gap_gate.py:to_floatalready hadmath.isfinite.- All 27 gate scripts now uniformly use
bool(failures) or (args.strict and bool(warnings))infinish(). - All 11
to_floatimplementations across gate scripts now rejectinf/nan.
Agent Skill Protocol (Agent 技能协议)
本节定义 AI Agent 如何使用本项目作为 skill 快速构建和审计医疗 ML 项目。
三种操作模式
模式 A:从零构建科研项目 (Build)
当用户说"帮我做一个预测模型"或"build a medical prediction project"时:
标准化 8 步流程:
Step 1: 环境检查 → python3 scripts/mlgg.py doctor
Step 2: 项目初始化 → python3 scripts/mlgg.py init --project-root <dir>
Step 3: 数据准备 → 下载数据集或放入用户数据,用 split_data.py 分割
Step 4: 配置对齐 → 确保 request.json + 所有 spec 文件正确
Step 5: 模型训练 → python3 scripts/mlgg.py train ...
Step 6: 执行认证 → python3 scripts/generate_execution_attestation.py ...
Step 7: 严格审计 → python3 scripts/mlgg.py workflow --strict
Step 8: 质量报告 → python3 scripts/quick_summary.py + python3 scripts/audit_external_project.pyAgent 决策点:
- Step 3 数据不足 (<100行)?→ 警告并建议更大数据集
- Step 5 候选模型不足?→ 自动扩大 model-pool
- Step 7 某个 gate 失败?→ 查询
references/error-knowledge-base.json定位修复方案 - Step 8 得分 <90?→ 生成 remediation_plan 并逐项修复
模式 B:审计他人项目 (Audit)
当用户说"帮我审查这个项目"或"review this ML project"时:
# 1. 量化评分
python3 scripts/audit_external_project.py --project-dir <dir> --target-journal nature_medicine --json
# 2. 如果已有 evidence 目录,运行完整 gate
python3 scripts/report_health_check.py --evidence-dir <dir>/evidence
# 3. 生成修复计划
python3 scripts/remediation_plan.py --evidence-dir <dir>/evidence审计输出:12 维度量化评分 (满分100) + 期刊差距分析 + 优先修复清单
模式 C:增量修复 (Fix)
当某个 gate 失败时:
1. 读取 gate report JSON → 提取 failure codes
2. 在 references/error-knowledge-base.json 中查找 → 获取修复方案
3. 如果找不到 → 诊断根因 → 应用修复 → 追加到 error-knowledge-base.json
4. 重跑失败的 gate → 验证通过
5. 重跑 publication_gate → 验证全链路通过模式 D:LLM 评审 Skill(零部署,带自己的 LLM)
当用户说"帮我生成评审 prompt"、"我想用 ChatGPT/Gemini 评审" 或 "export review prompt"时:
# 1. 快速红线检查 prompt(18条,粘贴到任意 LLM)
python3 scripts/export_review_prompt.py --level quick --output review_prompt_quick.md
# 2. 标准评审 prompt(53条)
python3 scripts/export_review_prompt.py --level standard --output review_prompt.md
# 3. 顶刊级 prompt,附 Nature Medicine 特定要求
python3 scripts/export_review_prompt.py --level comprehensive \
--journal nature_medicine --output review_prompt_nm.md
# 4. JSON 格式(适合 API 调用)
python3 scripts/export_review_prompt.py --level standard --format json \
--journal jama --output review_payload.json
# 5. 附文献引用
python3 scripts/export_review_prompt.py --level comprehensive \
--include-literature --output review_with_refs.md用法:将生成的 .md 文件内容粘贴到任意 LLM 对话框(Claude、GPT-4、Gemini 均可),然后粘贴论文 PDF 的文字内容,LLM 将输出结构化 JSON 评分报告。
支持的期刊 --journal 参数:nature_medicine · jama · lancet_digital_health · bmj · npj_digital_medicine
模式 E:批量评审 (Batch Review)
当用户说"帮我批量评审"或"review these projects"时:
# 1. 准备评审清单 (参考 references/batch-manifest.example.json)
# 2. 运行批量评审
python3 scripts/mlgg.py batch-review \
--manifest batch_manifest.json \
--target-journal nature_medicine \
--workers 4 \
--format json \
--output batch_report.json
# 3. 可选:输出 CSV 摘要
python3 scripts/mlgg.py batch-review \
--manifest batch_manifest.json \
--summary-csv batch_summary.csv批量评审输出:
- 对比矩阵:每个项目的 12 维度评分 + 总分 + 等级
- 跨项目分析:最常失败的维度 + 最普遍的差距
- 聚合修复优先级:去重后按严重性 × 影响项目数排序
文献检索协议:
- 查询
references/literature-knowledge-base.json(30 条顶刊文献) - 按类别 (
category)、实现的门控 (gates_implementing)、影响维度 (dimensions_affected) 搜索 - 在评审报告中引用
LIT-NNN编号 - 新增文献须符合:IF>10 期刊 / EQUATOR 指南 / PRISMA 系统评价
12 维度量化评分标准 (100分制)
用于量化评判任何医疗 ML 项目的质量:
| # | 维度 | 权重 | 评分要点 |
|---|---|---|---|
| 1 | 数据完整性 | 12 | Split 隔离、患者级不重叠、时序有序、无行重叠 |
| 2 | 防泄漏 | 15 | 无目标泄漏、无定义变量泄漏、无谱系泄漏、无未来特征 |
| 3 | 流水线隔离 | 12 | 预处理器仅在训练集 fit、插补器隔离、重采样仅在训练集 |
| 4 | 模型选择严谨性 | 10 | 候选池≥3、one-SE 规则、不窥探测试集、有基线比较 |
| 5 | 统计有效性 | 12 | Bootstrap CI、置换检验、校准、DCA、指标一致性 |
| 6 | 泛化证据 | 10 | Train-test gap、外部队列、Transport-drop CI、种子稳定性 |
| 7 | 临床完整性 | 7 | 完整指标面板、混淆矩阵一致性、阈值可行性 |
| 8 | 报告标准 | 7 | TRIPOD+AI、PROBAST+AI、STARD-AI、排除标准文档、局限性文档 |
| 9 | 可重复性 | 6 | 种子记录、版本追踪、执行认证、清单锁定 |
| 10 | 安全与溯源 | 3 | 模型签名、工件完整性、敏感数据保护 |
| 11 | 公平性与公正 | 3 | 均等化优势差距、差异影响比率、亚组性能最低标准 |
| 12 | 样本量充分性 | 3 | EPV≥10、收缩因子≥0.90、最小事件/非事件数≥100 |
评分解读:
- 90-100: 顶刊级 (Publication-grade) — 可直接投稿 Nature Medicine / Lancet DH / JAMA / BMJ
- 75-89: 有基础但需补充 (Solid but gaps) — 需要补充特定维度
- 60-74: 重大缺陷 (Major issues) — 需要系统性修复
- <60: 不可发表 (Not publishable) — 需要重新设计
顶刊级标准映射
各顶级期刊的核心要求已映射到本框架的 gate:
- 详见
references/journal-rigor-standards.json - 支持期刊:Nature Medicine, Lancet Digital Health, JAMA, BMJ, npj Digital Medicine
- Agent 可自动运行差距分析:
audit_external_project.py --target-journal <name>
自改进错误知识库协议
本项目维护一个结构化的错误模式数据库 (references/error-knowledge-base.json):
Agent 操作规范:
- 遇到新错误 → 先查知识库是否已有记录
- 已有记录 → 按
fix字段操作 → 验证修复 - 未找到 → 诊断根因 → 应用修复 → 验证 → 追加新条目(ERR-NNN 格式)
- 提交:
git commit -m "knowledge-base: add ERR-NNN <description>"
条目结构:
{
"id": "ERR-NNN",
"code": "error_code_string",
"symptom": "用户看到的症状",
"root_cause": "根因分析",
"fix": "具体修复步骤",
"prevention": "如何预防此类问题",
"category": "data|leakage|pipeline|model|gate|config|environment|attestation|security|statistical",
"severity": "CRITICAL|ERROR|WARNING|INFO",
"affected_files": ["file1.py"],
"first_seen": "YYYY-MM",
"resolved": true
}Agent 快速参考卡
┌─────────────────────────────────────────────────────────────┐
│ ML Leakage Guard — Agent Quick Reference │
├─────────────────────────────────────────────────────────────┤
│ 构建新项目: python3 scripts/mlgg.py onboarding --mode auto│
│ 审计项目: python3 scripts/audit_external_project.py │
│ 错误查询: references/error-knowledge-base.json │
│ 期刊标准: references/journal-rigor-standards.json │
│ 修复计划: python3 scripts/remediation_plan.py │
│ 健康检查: python3 scripts/report_health_check.py │
│ 证据对比: python3 scripts/evidence_comparator.py │
│ 阈值敏感: python3 scripts/threshold_sensitivity.py │
│ LaTeX导出: python3 scripts/export_latex.py │
├─────────────────────────────────────────────────────────────┤
│ 评分工具: audit_external_project.py --target-journal X │
│ 支持期刊: nature_medicine | lancet_digital_health | │
│ jama | bmj | npj_digital_medicine │
├─────────────────────────────────────────────────────────────┤
│ Gate 失败? 1. 读报告 2. 查知识库 3. 修复 4. 重跑 │
│ 得分 <90? 1. 运行 remediation_plan 2. 逐项修复 │
│ 新增错误? 追加到 error-knowledge-base.json │
└─────────────────────────────────────────────────────────────┘标准化交付物清单 (Publication-Ready Deliverables)
Agent 完成完整流程后应产出以下交付物:
<project>/
├── data/
│ ├── train.csv, valid.csv, test.csv # 分割后数据
│ └── external_*.csv # 外部验证队列
├── configs/
│ ├── request.json # 实验请求合同
│ ├── execution_attestation.json # 执行认证规范
│ └── *.json # 各类 spec 文件
├── evidence/
│ ├── *_report.json (×33) # 33 个 gate 报告
│ ├── manifest.json # SHA256 工件清单
│ ├── prediction_trace.csv.gz # 行级预测追踪
│ ├── evaluation_report.json # 评估指标报告
│ ├── model_selection_report.json # 模型选择报告
│ └── audit_report.json # 12维量化审计报告
├── models/
│ ├── model.pkl + model.pkl.sig # 签名模型工件
│ └── .mlgg_model_key # HMAC 密钥
├── keys/
│ └── *.pem # 认证密钥对
└── results/
├── summary.md # 人类可读摘要
└── tables.tex # LaTeX 表格方法论快速参考
Phase 1 Agent 引导协议
当用户说"帮我分析数据"/"我有一个 CSV"/"开始建模"时,Agent 必须按以下顺序逐步引导,不要跳过任何步骤。每步收集到答案后构建 cohort_definition_gate.py 的参数。
Step 1: 基本信息确认
问: 你的数据文件路径是什么?
问: 目标变量(要预测的结局)是哪一列?
问: 患者/个体 ID 列是哪一列?(如果没有,我会为你生成)→ 得到 --data, --target-col, --id-col
Step 2: 数据来源与抽样设计
问: 这个数据来自哪里?
a) 公共调查数据库(NHANES / BRFSS / NHIS / MEPS)→ 有复杂抽样设计
b) 医院 EHR / 电子病历系统
c) 临床试验 / 前瞻性队列
d) 行政索赔 / 医保数据
e) 疾病登记库(癌症登记、糖尿病登记)
f) 其他
如果是 (a): 问是否有抽样权重列(如 NHANES 的 WTMEC2YR),
提醒: "标准 ML 模型不使用调查权重,这会在论文 Limitations 中声明。"
→ 设置 --weight-col, --survey-source
如果是 (b)-(e): 问是单中心还是多中心?数据时间跨度?→ 得到 --weight-col, --survey-source
Step 3: 结局定义(最关键)
这是审稿人第一个会质疑的点。必须引导用户给出精确的临床定义。
问: 你要预测的结局(y=1)的临床定义是什么?
请告诉我以下信息:
1. 诊断标准来自哪些来源?(可多选)
□ ICD 编码(请给出具体码,如 E11 = T2D)
□ 实验室指标(如 HbA1c ≥ 6.5% 或 ≥ 48 mmol/mol)
□ 空腹血糖 ≥ 7.0 mmol/L
□ 医生诊断记录
□ 患者自报(问卷)
□ 用药记录(如服用降糖药)
□ 疾病登记库确认
□ 其他: ___
2. 如果使用了多个来源,如何判定?
□ 任一来源满足即为阳性(敏感,可能假阳性多)
□ 至少两个来源一致(UKB 金标准,推荐)
□ 所有来源都满足(极严格)
3. 疾病亚型是什么?
例: 2 型糖尿病(排除 1 型、妊娠期、继发性、MODY)
4. 排除标准:哪些人应该被排除?
例: 1 型糖尿病(E10) / 妊娠期糖尿病(O24) / 年龄<18
5. 时间窗口:
□ 基线时已患病(prevalent)
□ 随访期间新发(incident),随访 ___ 年
□ 事件性结局(如 30 天再入院)收集完毕后构建 JSON:
{
"criteria": [
{"source": "icd", "codes": ["E11"], "system": "ICD-10"},
{"source": "lab", "test": "HbA1c", "threshold": ">=6.5%"},
{"source": "medication", "drugs": ["metformin", "insulin"]}
],
"adjudication": "at_least_two",
"subtype": "type_2_diabetes",
"exclusions": ["type_1_E10", "gestational_O24", "age_under_18"],
"time_window": "prevalent_at_baseline",
"ascertainment": ["hospital_ehr", "lab_system"],
"validation": "cross_source_concordance"
}→ 传给 --outcome-definition
Step 3b: 入排标准与 CONSORT 流程
问: 你的入组标准是什么?(哪些人被纳入研究?)
例: 年龄 ≥ 18 岁、有 ≥ 1 次住院记录、基线无目标疾病
问: 排除了哪些人?每条排除标准各排了多少人?
这将用于 CONSORT/STROBE 流程图。请按顺序列出:
排除标准 1: ___ → 排除 ___ 人
排除标准 2: ___ → 排除 ___ 人
...
最终纳入: ___ 人
(如不清楚,Phase 1 报告会提供总行数和缺失统计,
但具体排除逻辑需要你根据临床知识决定)Step 3c: 预测时间点与特征时间归属 (MLGG-F05)
问: 你的模型在什么时间点做预测?
a) 入院时(只能用入院前已知的信息)
b) 出院时(可以用住院期间的信息)
c) 门诊就诊时(只能用当次就诊前的数据)
d) 随访中某个固定时间点
问: 你的特征中,哪些在预测时间点之后才能知道?
这些是"未来信息"——绝对不能用作预测特征!
例如:
- 入院时模型 → 出院诊断、手术类型、住院天数 都是"未来信息"
- 出院时模型 → 30天后的复诊结果 是"未来信息"
请将这些"未来信息"列名告诉我,我会帮你排除。Step 4: 定义变量泄漏检查
问: 上面这些用于定义结局的变量(如 HbA1c、ICD 码),
它们是否也出现在你的特征列中?
如果 HbA1c 用于定义糖尿病(y=1 当 HbA1c >= 6.5%),
那么 HbA1c 绝不能作为预测特征——它 IS 结局本身。
请列出所有用于定义结局的列名:→ 得到 --definition-cols
→ 这些列会被自动排除出特征集
Step 5: 运行门控
收集完上述信息后,构建并运行命令:
python3 scripts/cohort_definition_gate.py \
--data <path> \
--target-col <col> \
--id-col <col> \
--outcome-definition '<JSON>' \
--definition-cols <cols> \
--weight-col <col> \
--survey-source <source> \
--report evidence/cohort_definition_report.json \
--output-dir evidence/Step 6: 解读结果并引导下一步
根据报告中的 warnings/failures 向用户解释:
- Riley 样本量是否充足 → 不足则建议减少特征或收集更多数据
- 疾病定义质量评级 → single source 建议增加验证来源
- 定义变量泄漏 → 明确哪些列被排除了
- 调查权重 → 提醒在论文中声明
然后说: "Phase 1 完成。现在进入 Phase 2: 数据划分。你的数据是纵向的还是横截面的?"
疾病定义知识库 (RAG 检索源)
当用户提到要预测某种疾病时,Agent 应该立即查阅 references/disease-definition-knowledge-base.json,获取该疾病的:
- ICD-10 编码列表
- 实验室诊断标准(阈值、单位)
- 常用药物列表(用于药物记录作为辅助证据源)
- 排除标准(容易混淆的疾病)
- 必须排除的定义变量列表(
definition_variables_to_exclude) - 推荐的裁决策略
- 疾病分型信息
知识库覆盖 10 种常见疾病:
T2D · 高血压 · 冠心病 · CKD · 心衰 · 脑卒中 · COPD · 抑郁症 · 癌症(多部位) · 心房颤动 · 30天再入院
使用方法:
# Agent 在引导 Step 3 时读取知识库
import json
kb = json.load(open("references/disease-definition-knowledge-base.json"))
disease = kb["diseases"]["type_2_diabetes"]
# → 获取 ICD codes, lab criteria, medications, exclusions, definition_variables_to_exclude如果用户的疾病不在知识库中,Agent 应该按 general_guidance.choosing_definition 中的 7 条原则引导用户自行构建定义。
常见疾病定义模板(快速参考)
Agent 可以直接提供以下模板给用户参考:
2 型糖尿病 (T2D):
{"criteria":[{"source":"icd","codes":["E11"],"system":"ICD-10"},{"source":"lab","test":"HbA1c","threshold":">=6.5% or >=48mmol/mol"},{"source":"lab","test":"FPG","threshold":">=7.0mmol/L"},{"source":"medication","drugs":["metformin","glipizide","glimepiride","insulin"]},{"source":"self_report","question":"doctor_diagnosed_diabetes"}],"adjudication":"at_least_two","subtype":"type_2_diabetes","exclusions":["type_1_E10","gestational_O24","MODY","secondary","age_under_18"],"time_window":"prevalent_at_baseline"}高血压 (Hypertension):
{"criteria":[{"source":"icd","codes":["I10","I11","I12","I13","I15"],"system":"ICD-10"},{"source":"measurement","test":"SBP","threshold":">=140mmHg"},{"source":"measurement","test":"DBP","threshold":">=90mmHg"},{"source":"medication","drugs":["amlodipine","lisinopril","losartan","hydrochlorothiazide"]},{"source":"self_report","question":"doctor_diagnosed_hypertension"}],"adjudication":"at_least_two","subtype":"essential_hypertension","exclusions":["secondary_hypertension","white_coat","pregnancy_induced"],"time_window":"prevalent_at_baseline"}冠心病 (CHD/CAD):
{"criteria":[{"source":"icd","codes":["I20","I21","I22","I23","I24","I25"],"system":"ICD-10"},{"source":"procedure","codes":["CABG","PCI","coronary_angiography"]},{"source":"medication","drugs":["aspirin","clopidogrel","statin","nitroglycerin"]},{"source":"self_report","question":"doctor_diagnosed_heart_disease"}],"adjudication":"at_least_two","subtype":"coronary_artery_disease","exclusions":["heart_failure_only","valvular","congenital"],"time_window":"prevalent_at_baseline"}慢性肾病 (CKD):
{"criteria":[{"source":"icd","codes":["N18"],"system":"ICD-10"},{"source":"lab","test":"eGFR","threshold":"<60mL/min/1.73m2"},{"source":"lab","test":"UACR","threshold":">=30mg/g"},{"source":"medication","drugs":["SGLT2_inhibitors","ACE_inhibitors"]}],"adjudication":"at_least_two","subtype":"CKD_stage_3_plus","exclusions":["acute_kidney_injury","dialysis_dependent"],"time_window":"prevalent_at_baseline"}30 天再入院 (30-day Readmission):
{"criteria":[{"source":"administrative","definition":"unplanned_admission_within_30_days_of_discharge"}],"adjudication":"any_one","subtype":"all_cause_readmission","exclusions":["planned_readmission","death_before_30_days","transfer","left_AMA"],"time_window":"30_day_post_discharge"}样本量(Phase 1)
Riley 2019 三准则(riley_sample_size() in cohort_definition_gate.py):
- C1: 收缩因子 S ≥ 0.9 → n ≥ p / ((1-S) × φ)
- C2: R² optimism ≤ 0.05 → n ≥ p / 0.05
- C3: 风险精度 CI 半宽 ≤ 0.05 → n ≥ φ(1-φ) / (0.05/1.96)²
- 取三者最大值。EPV < 5 → FAIL,5-10 → WARNING
划分(Phase 2)
三种策略:grouped_temporal(纵向)、grouped_random(横截面)、stratified_grouped(横截面+保证正类率一致)。横截面数据用 --cross-sectional flag,自动跳过时序检查。
三种划分模式(根据数据量选择):
| 模式 | 参数 | 适用场景 | 模型选择方式 |
|---|---|---|---|
| 三分法 | --train-ratio 0.6 --valid-ratio 0.2 --test-ratio 0.2 |
大样本 (n > 5000) | valid 集调参 + test 集评估 |
| 两分法 | --train-ratio 0.8 --valid-ratio 0.0 --test-ratio 0.2 |
中等样本 (n 1000-5000) | CV 调参 + test 集评估 |
| 仅CV | --train-ratio 1.0 --valid-ratio 0.0 --test-ratio 0.0 |
小样本 (n < 1000) | Nested CV / Bootstrap 内部验证 |
Agent 引导时应根据 Phase 1 报告的样本量自动推荐:
n > 5000 → "样本量充足,推荐三分法 (60/20/20)"
n 1000-5000 → "中等样本,推荐两分法 (80/20) + 5折CV替代验证集"
n < 1000 → "小样本,考虑全量训练 + Nested CV 或 Bootstrap 内部验证"
n < 200 → "⚠️ 样本量可能不足,优先考虑 Riley 样本量检查结果"下游兼容性:
- 两分法 (valid_ratio=0):
train_select_evaluate.py自动切换--selection-data=cv_inner,用 5 折 CV 替代 valid 集做模型选择 - CV-only (test_ratio=0):Phase 6 评估使用 Bootstrap optimism correction 替代 test 集评估
--valid和--test参数已改为可选(不再 required)
已知限制:
- StratifiedKFold 在时序数据内部会 shuffle(CV 性能估计对有时间趋势的特征可能过于乐观)
- MIN_POSITIVE_PER_SPLIT=10 对罕见病 (<3% 患病率) 可能过严,可通过
--min-rows-per-split调整
编码(Phase 3)
自动检测(encode_categorical_features()):
- Binary (2值) → 0/1 映射,OOD → 0.5 sentinel(中性值,不添加额外列)
- Categorical (3-15值) → OneHot,OOD → 全零行
- Numeric (>15值) → 保持原值
特征选择(Phase 4)
Elastic Net CV (α∈{0.1-1.0}, C∈{0.001-10}) + Stability Selection (100次, 阈值0.6) + Group LASSO (OneHot 同进同退) + Ridge 对照 (损失>0.005则回退)。废弃单因素筛选。
模型选择(Phase 5)
Validation PR-AUC 最优 + one-SE rule 破平局。不用 train-test gap。Bootstrap optimism correction 内部验证。学习曲线评估收敛性。
评估(Phase 6)
5 域完整面板(calibration_metrics() + metric_panel() + compute_nri_idi() in _gate_utils.py):
- 区分度: AUROC, AUPRC
- 校准: 截距(→0), 斜率(→1), O:E(→1), ECE, Hosmer-Lemeshow
- 整体: Brier, Brier Skill Score (>0=优于基线)
- 分类: MCC, LR+/LR-, Sensitivity, Specificity, PPV, NPV
- 临床: DCA 净效用, NRI (categorical + continuous), IDI
SHAP(Phase 7)
多模型 SHAP(shap_interpretability_gate.py):
- 逐族计算 → L1 归一化为比例(sum=1) → 等权平均
- TreeExplainer(RF/XGB/CatBoost/LGBM), LinearExplainer(LR), KernelExplainer(其他)
- 一致性: Kendall tau + Top-N Jaccard
- 输出: Table A(集成排名), B(逐模型明细), C(一致性), D(个案解释)
Gate 失败恢复工作流
当任何 gate 失败时,按以下步骤排查:
1. 查看失败报告:
python3 scripts/explain_gate.py --report evidence/<gate_name>_report.json
2. 识别错误代码:
报告中 failures[].code → 查 references/error-knowledge-base.json
3. 常见错误快速修复:
- patient_id_overlap → 检查 split_data.py 的 --patient-id-col
- temporal_leakage → 确认 train 时间 < valid < test
- feature_name_suspicious → 检查 feature_lineage_spec
- calibration_poor → 添加 Platt scaling (calibrate.py)
- seed_instability → 增加模型正则化强度
- permutation_not_significant → 模型无效,考虑更换特征集
- SHAP_RANK_DISAGREEMENT → 模型间 Kendall tau 低,检查特征交互
- COHORT_EPV_CRITICAL → 减少候选特征数 或 收集更多数据
- COHORT_RILEY_UNDERPOWERED → 同上,参考 Riley 2019 三准则
4. 修复后重跑:
python3 scripts/mlgg.py workflow --request configs/request.json --strict
5. 仍然失败 → 检查完整知识库:
cat references/error-knowledge-base.json | python3 -m json.tool | grep -A5 "<error_code>"