ml-missing-data-imputation
Description
Missing Data Imputation
Research Question
Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.
Background
Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:
- Captures complex inter-feature dependencies
- Works well on datasets of varying sizes and feature types
- Produces imputations that preserve the statistical structure needed for downstream tasks
Task
Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.
Interface
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=42, max_iter=10):
...
def fit(self, X, y=None):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Learn imputation model
return self
def transform(self, X):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Return: numpy array (n_samples, n_features) with NO NaN values
return X_imputed
Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).
Evaluation
Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:
- Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
- Wine (178 samples, 13 features, 3-class classification)
- California Housing (5000 samples, 8 features, regression)
Two metrics per dataset:
- RMSE: Root Mean Squared Error between imputed and true values (lower is better)
- downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)
Code
1"""Custom missing data imputation benchmark.23This script evaluates a missing data imputation method across multiple datasets4with artificially introduced missing values. The agent should modify the EDITABLE5section to implement a novel imputation algorithm.67Datasets (selected by $ENV):8- breast_cancer: Classification, 569 samples x 30 features (binary)9- wine: Classification, 178 samples x 13 features (3-class)10- california: Regression, 20640 samples x 8 features (continuous target)1112Missing patterns: MCAR (Missing Completely At Random) at 20% rate.1314Metrics:15- rmse: Root Mean Squared Error of imputed vs true values (lower is better)
Results
| Model | Type | rmse breast cancer ↓ | downstream score breast cancer ↑ | rmse wine ↓ | downstream score wine ↑ |
|---|---|---|---|---|---|
| gain | baseline | 0.476 | 0.950 | 0.727 | 0.936 |
| knn | baseline | 0.586 | 0.950 | 0.796 | 0.942 |
| mean_impute | baseline | 0.994 | 0.946 | 1.033 | 0.927 |
| mice | baseline | 0.414 | 0.963 | 0.931 | 0.940 |
| missforest | baseline | 0.478 | 0.954 | 0.740 | 0.927 |
| anthropic/claude-opus-4.6 | vanilla | 0.394 | 0.953 | 0.797 | 0.933 |
| deepseek-reasoner | vanilla | 0.507 | 0.960 | 0.847 | 0.904 |
| google/gemini-3.1-pro-preview | vanilla | 0.514 | 0.940 | 0.806 | 0.916 |
| openai/gpt-5.4 | vanilla | 0.473 | 0.944 | 0.809 | 0.927 |
| qwen/qwen3.6-plus | vanilla | 0.502 | 0.946 | 0.798 | 0.921 |
| anthropic/claude-opus-4.6 | agent | 0.394 | 0.953 | 0.797 | 0.933 |
| deepseek-reasoner | agent | 0.407 | 0.946 | 0.792 | 0.910 |
| google/gemini-3.1-pro-preview | agent | 0.414 | 0.947 | 1.096 | 0.933 |
| openai/gpt-5.4 | agent | 0.452 | 0.953 | 0.823 | 0.938 |
| qwen/qwen3.6-plus | agent | 0.502 | 0.946 | 0.798 | 0.921 |