ml-calibration

Classical MLscikit-learnrigorous codebase

Description

Probability Calibration Method Design

Research Question

Design a novel post-hoc probability calibration method that maps uncalibrated classifier outputs to well-calibrated probabilities, minimizing the gap between predicted confidence and actual accuracy.

Background

Modern classifiers (neural networks, gradient boosting, random forests) often produce overconfident or poorly calibrated probability estimates. A well-calibrated model should satisfy: among all predictions where the model outputs probability p for a class, the fraction that are actually correct should be approximately p.

Classic calibration methods include Platt scaling (logistic regression on logits), isotonic regression (non-parametric monotonic mapping), and histogram binning (piecewise constant). Recent advances include temperature scaling, beta calibration, and spline-based methods. Each has trade-offs in flexibility, data efficiency, and monotonicity preservation.

Task

Implement the CalibrationMethod class in custom_calibration.py. The class has two methods:

fit(probs, labels): Learn a calibration mapping from uncalibrated probabilities and ground-truth labels on a held-out calibration set.
predict_proba(probs): Apply the learned mapping to produce calibrated probabilities.

For binary classification, probabilities are 1-D (positive class only). For multiclass, they are 2-D arrays where rows sum to 1.

Interface

class CalibrationMethod(BaseEstimator):
    def fit(self, probs, labels):
        # probs: (n,) for binary, (n, C) for multiclass
        # labels: (n,) integer class labels
        return self

    def predict_proba(self, probs):
        # Returns calibrated probabilities, same shape as input
        return calibrated_probs

Available imports in the template: numpy, scipy (optimize, interpolate, special), sklearn (various).

Evaluation

The method is evaluated on 4 classifier-dataset combinations spanning binary and multiclass settings:

Random Forest on MNIST (10-class)
MLP on Fashion-MNIST (10-class)
GBM on Madelon (binary)
SVM on Breast Cancer (binary)

Metrics (all lower is better):

ECE (Expected Calibration Error): Weighted average of |accuracy - confidence| across probability bins
Brier Score: Mean squared error between predicted probabilities and one-hot labels
NLL (Negative Log-Likelihood): Cross-entropy between predicted probabilities and true labels

Code

custom_calibration.py

EditableRead-only

1"""ML Calibration Benchmark.
2
3Evaluate post-hoc probability calibration methods across different classifiers
4and datasets.
5
6FIXED: Classifier training, data loading, evaluation metrics, train/calibrate/test split.
7EDITABLE: CalibrationMethod class (fit + predict_proba).
8
9Usage:
10    python scikit-learn/custom_calibration.py \
11        --classifier rf --dataset mnist --seed 42
12"""
13
14import argparse
15import math

Results

Show per-seed results

Model	Type	ECE rf-mnist ↓	Brier rf-mnist ↓	NLL rf-mnist ↓	ECE mlp-fashion mnist ↓	Brier mlp-fashion mnist ↓	NLL mlp-fashion mnist ↓	ECE gbm-madelon ↓	Brier gbm-madelon ↓	NLL gbm-madelon ↓	ECE svm-breast cancer ↓	Brier svm-breast cancer ↓	NLL svm-breast cancer ↓
isotonic_regression	baseline	0.016	0.075	0.237	0.012	0.184	0.386	-	-	-	0.019	0.033	0.095
platt_scaling	baseline	0.025	0.075	0.163	0.012	0.182	0.359	0.029	0.145	0.451	0.049	0.026	0.101
temperature_scaling	baseline	0.010	0.072	0.155	0.010	0.188	0.372	-	-	-	0.030	0.025	0.086