ml-subgroup-calibration-shift

Classical MLscikit-learnrigorous codebase

Description

Subgroup Calibration Under Distribution Shift

Research Question

Design a post-hoc calibration method that remains reliable when subgroup composition shifts between calibration and test time.

Background

Many calibration methods look good on average but fail on protected or operational subgroups once the test distribution shifts. This task isolates that failure mode. The fixed pipeline trains a tabular classifier, then applies a user-defined calibration mapping on held-out calibration data before evaluation on shifted test data.

Classical baselines include:

Temperature scaling: one global temperature for all samples
Isotonic regression: non-parametric monotone calibration
Beta calibration: a richer parametric mapping on probabilities
Group-wise temperature scaling: separate temperatures per subgroup

Task

Modify the CalibrationMethod class in custom_subgroup_calibration.py. The fixed code loads data, creates a shifted split, trains the base classifier, and computes metrics. Your method only controls the post-hoc calibration mapping.

class CalibrationMethod:
    def fit(self, probs, labels, groups=None):
        ...

    def predict_proba(self, probs, groups=None):
        ...

Inputs are positive-class probabilities from the base classifier. groups contains subgroup IDs when available and may be ignored by group-agnostic methods.

Evaluation

This benchmark uses three lightweight tabular proxies that are already available in the current scikit-learn package setup. We would normally prefer Adult, ACSIncome, COMPAS, and Law School Admissions, but those require package-level data changes that are outside this task directory. To keep the benchmark runnable offline, we use cached scikit-learn datasets with similar calibration and subgroup-shift behavior:

breast_cancer: binary classification on the scikit-learn breast cancer dataset
california_housing: binary high-value/low-value decision built from California housing
diabetes: binary high-risk/low-risk decision built from the diabetes target

For each dataset, the split is intentionally shifted:

a domain score determines the held-out test tail
subgroup labels are quartiles of a separate proxy feature
calibration is fit on the source region and evaluated on the shifted region

Metrics

Lower is better for:

worst_group_ece
brier
max_subgroup_gap

Higher is better for:

subgroup_auroc

Notes

The task is deliberately low compute and should run with a small tabular classifier.
If you need the exact Adult/ACSIncome/COMPAS/Law School datasets, they should be added through a package-level data change, not inside this task directory.

Code

custom_subgroup_calibration.py

EditableRead-only

1"""Subgroup calibration under distribution shift.
2
3The benchmark is intentionally offline and low compute. It uses cached
4scikit-learn tabular proxies instead of downloading Adult/ACSIncome/COMPAS/
5Law School because this task directory cannot change package-level data setup.
6
7Fixed:
8- dataset loading
9- shifted train/calibration/test split
10- base classifier training
11- metric computation
12
13Editable:
14- CalibrationMethod
15"""

Results

Model	Type	worst group ece breast cancer ↓	brier breast cancer ↓	subgroup auroc breast cancer ↑	max subgroup gap breast cancer ↓	worst group ece diabetes ↓	brier diabetes ↓	subgroup auroc diabetes ↑	max subgroup gap diabetes ↓	worst group ece california housing ↓	brier california housing ↓	subgroup auroc california housing ↑	max subgroup gap california housing ↓
beta_calibration	baseline	0.185	0.112	0.985	0.166	0.145	0.160	0.765	0.072	0.379	0.323	0.991	0.124
group_temperature_scaling	baseline	0.338	0.179	0.960	0.330	0.169	0.171	0.765	0.062	0.377	0.311	0.991	0.103
isotonic_regression	baseline	0.233	0.129	0.975	0.217	0.164	0.162	0.770	0.073	0.380	0.330	0.900	0.093
temperature_scaling	baseline	0.349	0.181	0.941	0.341	0.131	0.163	0.765	0.041	0.371	0.310	0.991	0.107
anthropic/claude-opus-4.6	vanilla	0.360	0.180	0.956	0.352	0.154	0.162	0.765	0.062	0.374	0.309	0.991	0.133
deepseek-reasoner	vanilla	0.349	0.181	0.941	0.341	0.194	0.169	0.765	0.087	0.377	0.311	0.991	0.104
google/gemini-3.1-pro-preview	vanilla	0.320	0.155	0.989	0.314	0.113	0.160	0.765	0.044	0.375	0.315	0.991	0.108
qwen/qwen3.6-plus	vanilla	0.097	0.049	0.985	0.085	0.154	0.168	0.765	0.048	0.401	0.284	0.991	0.286
anthropic/claude-opus-4.6	agent	0.180	0.123	0.985	0.124	0.143	0.162	0.765	0.064	0.373	0.316	0.991	0.117
deepseek-reasoner	agent	0.230	0.131	0.985	0.207	0.178	0.160	0.765	0.082	0.376	0.312	0.991	0.129
google/gemini-3.1-pro-preview	agent	0.320	0.155	0.989	0.314	0.113	0.160	0.765	0.044	0.375	0.315	0.991	0.108
openai/gpt-5.4	agent	-	-	-	-	-	-	-	-	-	-	-	-
qwen/qwen3.6-plus	agent	0.097	0.049	0.985	0.085	0.154	0.168	0.765	0.048	0.401	0.284	0.991	0.286

Agent Conversations

anthropic/claude-opus-4.6

7 steps

deepseek-reasoner

7 steps

google/gemini-3.1-pro-preview