ml-missing-data-imputation

Classical MLscikit-learnrigorous codebase

Description

Missing Data Imputation

Research Question

Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.

Background

Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:

Captures complex inter-feature dependencies
Works well on datasets of varying sizes and feature types
Produces imputations that preserve the statistical structure needed for downstream tasks

Task

Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.

Interface

class CustomImputer(BaseEstimator, TransformerMixin):
    def __init__(self, random_state=42, max_iter=10):
        ...

    def fit(self, X, y=None):
        # X: numpy array (n_samples, n_features) with NaN for missing values
        # Learn imputation model
        return self

    def transform(self, X):
        # X: numpy array (n_samples, n_features) with NaN for missing values
        # Return: numpy array (n_samples, n_features) with NO NaN values
        return X_imputed

Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).

Evaluation

Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:

Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
Wine (178 samples, 13 features, 3-class classification)
California Housing (5000 samples, 8 features, regression)

Two metrics per dataset:

RMSE: Root Mean Squared Error between imputed and true values (lower is better)
downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)

Code

custom_imputation.py

EditableRead-only

1"""Custom missing data imputation benchmark.
2
3This script evaluates a missing data imputation method across multiple datasets
4with artificially introduced missing values. The agent should modify the EDITABLE
5section to implement a novel imputation algorithm.
6
7Datasets (selected by $ENV):
8  - breast_cancer:  Classification, 569 samples x 30 features (binary)
9  - wine:           Classification, 178 samples x 13 features (3-class)
10  - california:     Regression, 20640 samples x 8 features (continuous target)
11
12Missing patterns: MCAR (Missing Completely At Random) at 20% rate.
13
14Metrics:
15  - rmse:           Root Mean Squared Error of imputed vs true values (lower is better)

Results

Show per-seed results

Model	Type	rmse breast cancer ↓	downstream score breast cancer ↑	rmse wine ↓	downstream score wine ↑
gain	baseline	0.476	0.950	0.727	0.936
knn	baseline	0.586	0.950	0.796	0.942
mean_impute	baseline	0.994	0.946	1.033	0.927
mice	baseline	0.414	0.963	0.931	0.940
missforest	baseline	0.478	0.954	0.740	0.927
anthropic/claude-opus-4.6	vanilla	0.394	0.953	0.797	0.933
deepseek-reasoner	vanilla	0.507	0.960	0.847	0.904
google/gemini-3.1-pro-preview	vanilla	0.514	0.940	0.806	0.916
openai/gpt-5.4	vanilla	0.473	0.944	0.809	0.927
qwen/qwen3.6-plus	vanilla	0.502	0.946	0.798	0.921
anthropic/claude-opus-4.6	agent	0.394	0.953	0.797	0.933
deepseek-reasoner	agent	0.407	0.946	0.792	0.910
google/gemini-3.1-pro-preview	agent	0.414	0.947	1.096	0.933
openai/gpt-5.4	agent	0.452	0.953	0.823	0.938
qwen/qwen3.6-plus	agent	0.502	0.946	0.798	0.921

Agent Conversations

anthropic/claude-opus-4.6

5 steps

deepseek-reasoner

7 steps

google/gemini-3.1-pro-preview