ml-missing-data-imputation

Classical MLscikit-learnrigorous codebase

Description

Missing Data Imputation

Research Question

Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.

Background

Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:

  • Captures complex inter-feature dependencies
  • Works well on datasets of varying sizes and feature types
  • Produces imputations that preserve the statistical structure needed for downstream tasks

Task

Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.

Interface

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=42, max_iter=10):
        ...

    def fit(self, X, y=None):
        # X: numpy array (n_samples, n_features) with NaN for missing values
        # Learn imputation model
        return self

    def transform(self, X):
        # X: numpy array (n_samples, n_features) with NaN for missing values
        # Return: numpy array (n_samples, n_features) with NO NaN values
        return X_imputed

Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).

Evaluation

Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:

  • Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
  • Wine (178 samples, 13 features, 3-class classification)
  • California Housing (5000 samples, 8 features, regression)

Two metrics per dataset:

  • RMSE: Root Mean Squared Error between imputed and true values (lower is better)
  • downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)

Code

custom_imputation.py
EditableRead-only
1"""Custom missing data imputation benchmark.
2
3This script evaluates a missing data imputation method across multiple datasets
4with artificially introduced missing values. The agent should modify the EDITABLE
5section to implement a novel imputation algorithm.
6
7Datasets (selected by $ENV):
8 - breast_cancer: Classification, 569 samples x 30 features (binary)
9 - wine: Classification, 178 samples x 13 features (3-class)
10 - california: Regression, 20640 samples x 8 features (continuous target)
11
12Missing patterns: MCAR (Missing Completely At Random) at 20% rate.
13
14Metrics:
15 - rmse: Root Mean Squared Error of imputed vs true values (lower is better)

Results

ModelTypermse breast cancer downstream score breast cancer rmse wine downstream score wine
gainbaseline0.4760.9500.7270.936
knnbaseline0.5860.9500.7960.942
mean_imputebaseline0.9940.9461.0330.927
micebaseline0.4140.9630.9310.940
missforestbaseline0.4780.9540.7400.927
anthropic/claude-opus-4.6vanilla0.3940.9530.7970.933
deepseek-reasonervanilla0.5070.9600.8470.904
google/gemini-3.1-pro-previewvanilla0.5140.9400.8060.916
openai/gpt-5.4vanilla0.4730.9440.8090.927
qwen/qwen3.6-plusvanilla0.5020.9460.7980.921
anthropic/claude-opus-4.6agent0.3940.9530.7970.933
deepseek-reasoneragent0.4070.9460.7920.910
google/gemini-3.1-pro-previewagent0.4140.9471.0960.933
openai/gpt-5.4agent0.4520.9530.8230.938
qwen/qwen3.6-plusagent0.5020.9460.7980.921

Agent Conversations