Agent Conversation: deepseek-reasoner
Task: ml-missing-data-imputation
Missing Data Imputation
Research Question
Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.
Background
Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:
- Captures complex inter-feature dependencies
- Works well on datasets of varying sizes and feature types
- Produces imputations that preserve the statistical structure needed for downstream tasks
Task
Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.
Interface
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=42, max_iter=10):
...
def fit(self, X, y=None):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Learn imputation model
return self
def transform(self, X):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Return: numpy array (n_samples, n_features) with NO NaN values
return X_imputed
Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).
Evaluation
Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:
- Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
- Wine (178 samples, 13 features, 3-class classification)
- California Housing (5000 samples, 8 features, regression)
Two metrics per dataset:
- RMSE: Root Mean Squared Error between imputed and true values (lower is better)
- downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)
scikit-learn/custom_imputation.py [EDITABLE — lines 36–131 only]
1: """Custom missing data imputation benchmark.
2:
3: This script evaluates a missing data imputation method across multiple datasets
4: with artificially introduced missing values. The agent should modify the EDITABLE
5: section to implement a novel imputation algorithm.
6:
7: Datasets (selected by $ENV):
8: - breast_cancer: Classification, 569 samples x 30 features (binary)
9: - wine: Classification, 178 samples x 13 features (3-class)
10: - california: Regression, 20640 samples x 8 features (continuous target)
11:
12: Missing patterns: MCAR (Missing Completely At Random) at 20% rate.
13:
14: Metrics:
15: - rmsTool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_imputation.py",
4 "start_line": 41,
5 "end_line": 131,
6 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Enhanced Iterative Forest Imputation (CEIFI).\n\n Algorithm:\n 1. Compute pairwise feature correlations from observed data.\n 2. For each feature, select top-k most correlated features as predictors.\n 3. Initialize missing values with column means.\n 4. Iterate up to max_iter:\n a. For each feature with missing values (sorted by missingness):\n - Use selected predictors to train a RandomForest regressor.\n - Predict missing values.\n b. Check convergence (normalized difference < tol).\n 5. Return imputed matrix.\n\n This approach leverages feature correlations to reduce noise and improve\n imputation accuracy, especially when features have varying relevance.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_estimators = 100\n self.top_k = 5 # number of correlated features to use\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n # Store correlation matrix and predictor indices\n self._compute_correlations(X)\n # Fit by running imputation on X (store state)\n self._X_fitted = X.copy()\n self._fit_transform_internal(X)\n return self\n\n def transform(self, X):\n return self._fit_transform_internal(X)\n\n def fit_transform(self, X, y=None):\n return self._fit_transform_internal(X)\n\n def _compute_correlations(self, X):\n \"\"\"Compute pairwise correlations and select top-k predictors per feature.\"\"\"\n n_features = X.shape[1]\n # Use helper function to compute correlations\n corr = compute_feature_correlations(X)\n self.corr_matrix_ = corr\n self.predictor_indices_ = []\n for j in range(n_features):\n # Get absolute correlations, exclude self\n abs_corr = np.abs(corr[j])\n abs_corr[j] = -1 # ignore self\n # Top k indices (largest absolute correlation)\n top_indices = np.argsort(abs_corr)[-self.top_k:]\n self.predictor_indices_.append(top_indices)\n\n def _fit_transform_internal(self, X):\n from sklearn.ensemble import RandomForestRegressor\n\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Step 1: Initial imputation with column means\n col_means = np.nanmean(X_imp, axis=0)\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n X_imp[mask_j, j] = col_means[j]\n\n # If correlations not computed yet, compute them\n if not hasattr(self, 'predictor_indices_'):\n self._compute_correlations(X_imp)\n\n # Identify which features have missing values and sort by missingness\n miss_count = np.isnan(X).sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n # Sort by number of missing values (ascending)\n features_with_missing = features_with_missing[\n np.argsort(miss_count[features_with_missing])\n ]\n\n if len(features_with_missing) == 0:\n return X_imp\n\n # Step 2: Iterative imputation\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in features_with_missing:\n # Observed and missing indices for feature j\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n # Predictor features (top correlated ones)\n pred_indices = self.predictor_indices_[j]\n # Ensure predictors do not include j (should be already excluded)\n pred_indices = [k for k in pred_indices if k != j]\n # If no predictors left, use all other features\n if len(pred_indices) == 0:\n pred_indices = [k for k in range(n_features) if k != j]\n\n X_train = X_imp[obs_mask][:, pred_indices]\n y_train = X[obs_mask, j] # Use original observed values\n X_pred = X_imp[mis_mask][:, pred_indices]\n\n # Train random forest and predict\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n X_imp[mis_mask, j] = rf.predict(X_pred)\n\n # Step 3: Check convergence\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n return X_imp\n\n\ndef compute_feature_correlations(X):\n \"\"\"Compute pairwise correlations, ignoring NaN pairs.\n\n Args:\n X: array of shape (n_samples, n_features) with possible NaN values\n\n Returns:\n corr: array of shape (n_features, n_features) with correlation coefficients\n \"\"\"\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr"
7}OK: Replaced lines 41..131 with 142 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–182.
[Current file: scikit-learn/custom_imputation.py | editable: 36–182 | total: 312 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (141 more lines) ...
180 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
181 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
182 return corr[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)
### breast_cancer (scripts/breast_cancer.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.506569
TRAIN_METRICS downstream_score=0.959587
TRAIN_METRICS baseline_no_missing=0.949076
Test results (breast_cancer):
rmse: 0.506569
downstream_score: 0.959587
### wine (scripts/wine.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.846945
TRAIN_METRICS downstream_score=0.904286
TRAIN_METRICS baseline_no_missing=0.921429
Test results (wine):
rmse: 0.846945
downstream_score: 0.904286
[Leaderboard] Results saved: {'rmse_breast_cancer': 0.506569, 'downstream_score_breast_cancer': 0.959587, 'elapsed_breast_cancer': 61, 'rmse_wine': 0.846945, 'downstream_score_wine': 0.904286, 'elapsed_wine': 61, 'rmse_california': 0.758811, 'downstream_score_california': 0.671729, 'elapsed_calif
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_imputation.py",
4 "start_line": 41,
5 "end_line": 182,
6 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"SoftImpute: Matrix completion via nuclear norm regularization.\n\n Implements the SoftImpute algorithm (Mazumder et al., 2010):\n 1. Initialize missing entries with column means.\n 2. Repeat until convergence:\n a. Compute SVD of current completed matrix.\n b. Apply soft-thresholding to singular values: shrink(s, lambda) = max(s - lambda, 0).\n c. Reconstruct low-rank matrix.\n d. Replace missing entries with reconstructed values; keep observed entries fixed.\n e. Check convergence (relative change in imputed values).\n\n The nuclear norm regularization encourages low-rank solutions, which is appropriate\n when features are correlated and the data matrix is approximately low-rank.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.lambda_ = None # will be set based on data\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n # Fit by running imputation on X (store state)\n self._X_fitted = X.copy()\n self._fit_transform_internal(X)\n return self\n\n def transform(self, X):\n return self._fit_transform_internal(X)\n\n def fit_transform(self, X, y=None):\n return self._fit_transform_internal(X)\n\n def _fit_transform_internal(self, X):\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Step 1: Initial imputation with column means\n col_means = np.nanmean(X_imp, axis=0)\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n X_imp[mask_j, j] = col_means[j]\n\n # Determine lambda (regularization parameter) if not set\n if self.lambda_ is None:\n # Heuristic: lambda = 0.5 * sqrt(n_samples * n_features) * sigma_est\n # Estimate noise variance via column variances\n col_vars = np.nanvar(X, axis=0)\n sigma_est = np.sqrt(np.nanmean(col_vars))\n self.lambda_ = 0.5 * np.sqrt(max(n_samples, n_features)) * sigma_est\n\n # Identify missing positions (fixed across iterations)\n missing_mask = np.isnan(X)\n observed_mask = ~missing_mask\n\n # Step 2: Iterative soft-thresholded SVD\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n # Compute SVD of current completed matrix\n # Use randomized SVD for efficiency\n from sklearn.utils.extmath import randomized_svd\n U, s, Vt = randomized_svd(X_imp, n_components=min(n_samples, n_features, 50),\n random_state=self.random_state)\n\n # Soft-threshold singular values\n s_thresh = np.maximum(s - self.lambda_, 0.0)\n\n # Reconstruct low-rank matrix\n X_lr = U @ np.diag(s_thresh) @ Vt\n\n # Update missing entries with reconstructed values\n X_imp[missing_mask] = X_lr[missing_mask]\n\n # Keep observed entries unchanged (they are already unchanged because we didn't modify them)\n\n # Check convergence\n diff = np.linalg.norm(X_imp[missing_mask] - X_prev[missing_mask])\n norm = np.linalg.norm(X_imp[missing_mask])\n if norm > 0 and diff / norm < self.tol:\n break\n\n return X_imp\n\n\ndef compute_feature_correlations(X):\n \"\"\"Compute pairwise correlations, ignoring NaN pairs.\n\n Args:\n X: array of shape (n_samples, n_features) with possible NaN values\n\n Returns:\n corr: array of shape (n_features, n_features) with correlation coefficients\n \"\"\"\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr"
7}OK: Replaced lines 41..182 with 104 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–144.
[Current file: scikit-learn/custom_imputation.py | editable: 36–144 | total: 274 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (103 more lines) ...
142 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
143 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
144 return corr[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)
### breast_cancer (scripts/breast_cancer.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.555873
TRAIN_METRICS downstream_score=0.952569
TRAIN_METRICS baseline_no_missing=0.949076
Test results (breast_cancer):
rmse: 0.555873
downstream_score: 0.952569
### wine (scripts/wine.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.840925
TRAIN_METRICS downstream_score=0.932698
TRAIN_METRICS baseline_no_missing=0.921429
Test results (wine):
rmse: 0.840925
downstream_score: 0.932698
[Leaderboard] Results saved: {'rmse_breast_cancer': 0.555873, 'downstream_score_breast_cancer': 0.952569, 'elapsed_breast_cancer': 6, 'rmse_wine': 0.840925, 'downstream_score_wine': 0.932698, 'elapsed_wine': 6, 'rmse_california': 0.839398, 'downstream_score_california': 0.626803, 'elapsed_californ
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_imputation.py",
4 "start_line": 41,
5 "end_line": 144,
6 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Adaptive Iterative Imputation with Ensemble of Regressors (AIER).\n\n Algorithm:\n 1. Initialize missing values with column means.\n 2. For each iteration until convergence:\n a. For each feature with missing values (sorted by missingness):\n - Use all other features as predictors.\n - Train three base regressors: BayesianRidge, RandomForest, KNN.\n - Compute cross-validation errors on observed data for each regressor.\n - Set weights inversely proportional to errors.\n - Train each regressor on all observed data and predict missing values.\n - Combine predictions using weighted average.\n b. Update missing entries.\n c. Check convergence.\n\n This adaptively blends linear, tree-based, and neighbor-based models,\n leveraging their complementary strengths for different feature relationships.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_estimators = 100\n self.n_neighbors = 5\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n # Store the fitted state by running fit_transform internally\n self._X_fitted = X.copy()\n self._fit_transform_internal(X)\n return self\n\n def transform(self, X):\n return self._fit_transform_internal(X)\n\n def fit_transform(self, X, y=None):\n return self._fit_transform_internal(X)\n\n def _fit_transform_internal(self, X):\n from sklearn.linear_model import BayesianRidge\n from sklearn.ensemble import RandomForestRegressor\n from sklearn.neighbors import KNeighborsRegressor\n from sklearn.model_selection import KFold\n\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Step 1: Initial imputation with column means\n col_means = np.nanmean(X_imp, axis=0)\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n X_imp[mask_j, j] = col_means[j]\n\n # Identify which features have missing values and sort by missingness\n miss_count = np.isnan(X).sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n # Sort by number of missing values (ascending)\n features_with_missing = features_with_missing[\n np.argsort(miss_count[features_with_missing])\n ]\n\n if len(features_with_missing) == 0:\n return X_imp\n\n # Prepare base regressors\n br = BayesianRidge()\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n random_state=self.random_state,\n n_jobs=-1\n )\n knn = KNeighborsRegressor(\n n_neighbors=self.n_neighbors,\n weights='distance'\n )\n regressors = [br, rf, knn]\n n_reg = len(regressors)\n\n # Step 2: Iterative imputation\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in features_with_missing:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n if mis_mask.sum() == 0:\n continue\n\n # Predictor features (all except j)\n other_features = [k for k in range(n_features) if k != j]\n X_all = X_imp[:, other_features]\n y_true = X[:, j] # original with NaNs\n\n # Observed data\n X_obs = X_all[obs_mask]\n y_obs = y_true[obs_mask]\n\n # If too few observations, fall back to mean\n if len(y_obs) < 5:\n X_imp[mis_mask, j] = col_means[j]\n continue\n\n # Compute cross-validation errors for weighting\n cv = KFold(n_splits=min(5, len(y_obs)), shuffle=True, random_state=self.random_state)\n errors = []\n for reg in regressors:\n preds = []\n truths = []\n for train_idx, val_idx in cv.split(X_obs):\n reg_clone = self._clone_regressor(reg)\n reg_clone.fit(X_obs[train_idx], y_obs[train_idx])\n pred = reg_clone.predict(X_obs[val_idx])\n preds.extend(pred)\n truths.extend(y_obs[val_idx])\n mse = np.mean((np.array(preds) - np.array(truths)) ** 2)\n errors.append(mse + 1e-8) # avoid zero\n\n # Weights inversely proportional to errors\n inv_errors = 1.0 / np.array(errors)\n weights = inv_errors / inv_errors.sum()\n\n # Train each regressor on full observed data and predict missing values\n predictions = np.zeros((mis_mask.sum(), n_reg))\n for idx, reg in enumerate(regressors):\n reg_clone = self._clone_regressor(reg)\n reg_clone.fit(X_obs, y_obs)\n X_mis = X_all[mis_mask]\n predictions[:, idx] = reg_clone.predict(X_mis)\n\n # Weighted average\n weighted_pred = np.dot(predictions, weights)\n X_imp[mis_mask, j] = weighted_pred\n\n # Check convergence\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n return X_imp\n\n def _clone_regressor(self, reg):\n \"\"\"Return a fresh copy of the regressor.\"\"\"\n from sklearn.base import clone\n return clone(reg)\n\n\ndef compute_feature_correlations(X):\n \"\"\"Compute pairwise correlations, ignoring NaN pairs.\n\n Args:\n X: array of shape (n_samples, n_features) with possible NaN values\n\n Returns:\n corr: array of shape (n_features, n_features) with correlation coefficients\n \"\"\"\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr"
7}OK: Replaced lines 41..144 with 167 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–207.
[Current file: scikit-learn/custom_imputation.py | editable: 36–207 | total: 337 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (166 more lines) ...
205 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
206 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
207 return corr[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### breast_cancer (scripts/breast_cancer.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.407222 TRAIN_METRICS downstream_score=0.945552 TRAIN_METRICS baseline_no_missing=0.949076 Test results (breast_cancer): rmse: 0.407222 downstream_score: 0.945552 ### wine (scripts/wine.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.791544 TRAIN_METRICS downstream_score=0.909524 TRAIN_METRICS baseline_no_missing=0.921429 Test results (wine): rmse: 0.791544 downstream_score: 0.909524 ### california (scripts/california.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.732366 TRAIN_METRICS downstrea
Tool Input
1{
2 "n": 3
3}[submit] Submitting result from test #3 as final. ### breast_cancer (scripts/breast_cancer.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.407222 TRAIN_METRICS downstream_score=0.945552 TRAIN_METRICS baseline_no_missing=0.949076 Test results (breast_cancer): rmse: 0.407222 downstream_score: 0.945552 ### wine (scripts/wine.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.791544 TRAIN_METRICS downstream_score=0.909524 TRAIN_METRICS baseline_no_missing=0.921429 Test results (wine): rmse: 0.791544 downstream_score: 0.909524 ### california (scripts/california.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.732366 TRAIN_METRICS downstream_score=0.672041 TRAIN_METRICS baseline_no_missing=0.762435 Test results (california): rmse: 0.732366 downstream_score: 0.672041 [submit] Finalized test #3 as fina