Agent Conversation: google/gemini-3.1-pro-preview
Task: ml-missing-data-imputation
Missing Data Imputation
Research Question
Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.
Background
Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:
- Captures complex inter-feature dependencies
- Works well on datasets of varying sizes and feature types
- Produces imputations that preserve the statistical structure needed for downstream tasks
Task
Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.
Interface
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=42, max_iter=10):
...
def fit(self, X, y=None):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Learn imputation model
return self
def transform(self, X):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Return: numpy array (n_samples, n_features) with NO NaN values
return X_imputed
Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).
Evaluation
Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:
- Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
- Wine (178 samples, 13 features, 3-class classification)
- California Housing (5000 samples, 8 features, regression)
Two metrics per dataset:
- RMSE: Root Mean Squared Error between imputed and true values (lower is better)
- downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)
scikit-learn/custom_imputation.py [EDITABLE — lines 36–131 only]
1: """Custom missing data imputation benchmark.
2:
3: This script evaluates a missing data imputation method across multiple datasets
4: with artificially introduced missing values. The agent should modify the EDITABLE
5: section to implement a novel imputation algorithm.
6:
7: Datasets (selected by $ENV):
8: - breast_cancer: Classification, 569 samples x 30 features (binary)
9: - wine: Classification, 178 samples x 13 features (3-class)
10: - california: Regression, 20640 samples x 8 features (continuous target)
11:
12: Missing patterns: MCAR (Missing Completely At Random) at 20% rate.
13:
14: Metrics:
15: - rmsTool Input
1{
2 "filename": "custom_imputation.py",
3 "end_line": 131,
4 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Latent and Missingness Augmented Iterative Imputation.\n \n Combines global low-rank structure extraction (SoftImpute/SVD) with \n nonlinear local feature modeling (ExtraTrees). \n \n Motivation:\n Tree-based regressors capture local nonlinear interactions well but struggle \n with global multicollinearity and low-rank representations. Linear methods\n (like Matrix Factorization or MICE) create excellent global representations \n but ignore nonlinear local structures.\n At each iteration, we decompose the current imputed matrix to a low-rank\n soft-thresholded proxy. These low-rank features encapsulate global \n covariation and are appended to the raw conditional features, along with \n the missingness mask. The extra trees effectively combine global context, \n missingness structure, and local conditional variables.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10, shrinkage_ratio=0.1, n_estimators=60):\n self.random_state = random_state\n self.max_iter = max_iter\n self.shrinkage_ratio = shrinkage_ratio\n self.n_estimators = n_estimators\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n return self\n\n def transform(self, X):\n from sklearn.ensemble import ExtraTreesRegressor\n \n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n \n # Step 1: Initial imputation with column means\n col_means = np.nanmean(X_imp, axis=0)\n col_means[np.isnan(col_means)] = 0.0 # fallback\n \n mask = np.isnan(X)\n for j in range(n_features):\n mask_j = mask[:, j]\n if mask_j.any():\n X_imp[mask_j, j] = col_means[j]\n\n miss_count = mask.sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n # Sort by number of missing values (ascending)\n features_with_missing = features_with_missing[\n np.argsort(miss_count[features_with_missing])\n ]\n \n if len(features_with_missing) == 0:\n return X_imp\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n \n # Step 2: Global Low-Rank Context via SoftImpute Step\n X_centered = X_imp - col_means\n U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)\n lambda_val = self.shrinkage_ratio * s[0] if len(s) > 0 else 0\n s_shrunk = np.maximum(s - lambda_val, 0)\n X_soft = (U * s_shrunk) @ Vt + col_means\n \n for j in features_with_missing:\n mis_mask = mask[:, j]\n obs_mask = ~mis_mask\n \n if mis_mask.sum() == 0:\n continue\n \n other_features = [k for k in range(n_features) if k != j]\n \n # Step 3: Latent and Missingness Augmentation\n X_train_raw = X_imp[obs_mask][:, other_features]\n X_train_soft = X_soft[obs_mask][:, other_features]\n X_train_mask = mask[obs_mask][:, other_features].astype(float)\n \n X_train = np.hstack((X_train_raw, X_train_soft, X_train_mask))\n y_train = X[obs_mask, j]\n \n X_pred_raw = X_imp[mis_mask][:, other_features]\n X_pred_soft = X_soft[mis_mask][:, other_features]\n X_pred_mask = mask[mis_mask][:, other_features].astype(float)\n \n X_pred = np.hstack((X_pred_raw, X_pred_soft, X_pred_mask))\n \n rf = ExtraTreesRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n random_state=self.random_state + iteration,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n X_imp[mis_mask, j] = rf.predict(X_pred)\n \n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n \n return X_imp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).transform(X)\n\n\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr\n",
5 "op": "replace",
6 "after_line": 131,
7 "start_line": 40
8}ERROR: Package 'custom_imputation.py' is not in allowed packages
Tool Input
1{
2 "end_line": 131,
3 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Latent and Missingness Augmented Iterative Imputation.\n \n Combines global low-rank structure extraction (SoftImpute/SVD) with \n nonlinear local feature modeling (ExtraTrees). \n \n Motivation:\n Tree-based regressors capture local nonlinear interactions well but struggle \n with global multicollinearity and low-rank representations. Linear methods\n (like Matrix Factorization or MICE) create excellent global representations \n but ignore nonlinear local structures.\n At each iteration, we decompose the current imputed matrix to a low-rank\n soft-thresholded proxy. These low-rank features encapsulate global \n covariation and are appended to the raw conditional features, along with \n the missingness mask. The extra trees effectively combine global context, \n missingness structure, and local conditional variables.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10, shrinkage_ratio=0.1, n_estimators=60):\n self.random_state = random_state\n self.max_iter = max_iter\n self.shrinkage_ratio = shrinkage_ratio\n self.n_estimators = n_estimators\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n return self\n\n def transform(self, X):\n from sklearn.ensemble import ExtraTreesRegressor\n \n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n \n # Step 1: Initial imputation with column means\n col_means = np.nanmean(X_imp, axis=0)\n col_means[np.isnan(col_means)] = 0.0 # fallback\n \n mask = np.isnan(X)\n for j in range(n_features):\n mask_j = mask[:, j]\n if mask_j.any():\n X_imp[mask_j, j] = col_means[j]\n\n miss_count = mask.sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n # Sort by number of missing values (ascending)\n features_with_missing = features_with_missing[\n np.argsort(miss_count[features_with_missing])\n ]\n \n if len(features_with_missing) == 0:\n return X_imp\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n \n # Step 2: Global Low-Rank Context via SoftImpute Step\n X_centered = X_imp - col_means\n U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)\n lambda_val = self.shrinkage_ratio * s[0] if len(s) > 0 else 0\n s_shrunk = np.maximum(s - lambda_val, 0)\n X_soft = (U * s_shrunk) @ Vt + col_means\n \n for j in features_with_missing:\n mis_mask = mask[:, j]\n obs_mask = ~mis_mask\n \n if mis_mask.sum() == 0:\n continue\n \n other_features = [k for k in range(n_features) if k != j]\n \n # Step 3: Latent and Missingness Augmentation\n X_train_raw = X_imp[obs_mask][:, other_features]\n X_train_soft = X_soft[obs_mask][:, other_features]\n X_train_mask = mask[obs_mask][:, other_features].astype(float)\n \n X_train = np.hstack((X_train_raw, X_train_soft, X_train_mask))\n y_train = X[obs_mask, j]\n \n X_pred_raw = X_imp[mis_mask][:, other_features]\n X_pred_soft = X_soft[mis_mask][:, other_features]\n X_pred_mask = mask[mis_mask][:, other_features].astype(float)\n \n X_pred = np.hstack((X_pred_raw, X_pred_soft, X_pred_mask))\n \n rf = ExtraTreesRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n random_state=self.random_state + iteration,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n X_imp[mis_mask, j] = rf.predict(X_pred)\n \n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n \n return X_imp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).transform(X)\n\n\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr",
4 "start_line": 40,
5 "op": "replace",
6 "after_line": 131,
7 "filename": "scikit-learn/custom_imputation.py"
8}OK: Replaced lines 40..131 with 118 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–157.
[Current file: scikit-learn/custom_imputation.py | editable: 36–157 | total: 287 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (116 more lines) ...
155 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
156 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
157 return corr[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)
### breast_cancer (scripts/breast_cancer.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.514071
TRAIN_METRICS downstream_score=0.940273
TRAIN_METRICS baseline_no_missing=0.949076
Test results (breast_cancer):
rmse: 0.514071
downstream_score: 0.940273
### wine (scripts/wine.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.806119
TRAIN_METRICS downstream_score=0.915556
TRAIN_METRICS baseline_no_missing=0.921429
Test results (wine):
rmse: 0.806119
downstream_score: 0.915556
[Leaderboard] Results saved: {'rmse_breast_cancer': 0.514071, 'downstream_score_breast_cancer': 0.940273, 'elapsed_breast_cancer': 45, 'rmse_wine': 0.806119, 'downstream_score_wine': 0.915556, 'elapsed_wine': 45, 'rmse_california': 0.734353, 'downstream_score_california': 0.662919, 'elapsed_calif
Tool Input
1{
2 "after_line": 111,
3 "end_line": 153,
4 "op": "replace",
5 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Residual-Boosted Hybrid Iterative Imputation.\n \n Combines strong global linear modeling (Ridge regression) with robust \n non-linear modeling (HistGradientBoosting) on the residuals.\n \n Motivation:\n Linear models (like BayesianRidge in MICE) are highly effective at \n capturing multicollinearity and exact linear dependencies (which dominates \n some datasets like Breast Cancer). However, they fail on complex non-linear \n features. Tree-based models (like ExtraTrees in MissForest) excel at \n non-linear local interactions but struggle to capture global linear \n covariance exactly. \n This hybrid regressor first fits a linear model, then applies gradient \n boosting to the residuals. We wrap this in an iterative imputation scheme \n with target clipping to prevent divergence. We also supply original \n missingness masks to the regressors to exploit structure in the missingness.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10, alpha=1.0):\n self.random_state = random_state\n self.max_iter = max_iter\n self.alpha = alpha\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n return self\n\n def transform(self, X):\n from sklearn.linear_model import Ridge\n from sklearn.ensemble import HistGradientBoostingRegressor\n \n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n \n # Step 1: Initial imputation with column means\n col_means = np.nanmean(X_imp, axis=0)\n col_means[np.isnan(col_means)] = 0.0 # fallback\n col_mins = np.nanmin(X, axis=0)\n col_maxs = np.nanmax(X, axis=0)\n \n mask = np.isnan(X)\n for j in range(n_features):\n mask_j = mask[:, j]\n if mask_j.any():\n X_imp[mask_j, j] = col_means[j]\n\n miss_count = mask.sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n features_with_missing = features_with_missing[\n np.argsort(miss_count[features_with_missing])\n ]\n \n if len(features_with_missing) == 0:\n return X_imp\n\n # Missingness mask array to augment features\n mask_f = mask.astype(float)\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n \n for j in features_with_missing:\n mis_mask = mask[:, j]\n obs_mask = ~mis_mask\n \n if mis_mask.sum() == 0:\n continue\n \n other_features = [k for k in range(n_features) if k != j]\n \n # Combine other features and missingness masks for predictors\n X_train_base = X_imp[obs_mask][:, other_features]\n X_train_mask = mask_f[obs_mask][:, other_features]\n X_train = np.hstack((X_train_base, X_train_mask))\n y_train = X[obs_mask, j]\n \n X_pred_base = X_imp[mis_mask][:, other_features]\n X_pred_mask = mask_f[mis_mask][:, other_features]\n X_pred = np.hstack((X_pred_base, X_pred_mask))\n \n # 1. Linear Fit\n ridge = Ridge(alpha=self.alpha)\n ridge.fit(X_train, y_train)\n y_pred_linear = ridge.predict(X_pred)\n y_train_linear = ridge.predict(X_train)\n \n # 2. Residual Tree Fit\n res = y_train - y_train_linear\n if np.std(res) > 1e-5:\n tree = HistGradientBoostingRegressor(\n max_iter=50, \n max_depth=5, \n random_state=self.random_state + iteration,\n min_samples_leaf=3\n )\n tree.fit(X_train, res)\n y_pred_tree = tree.predict(X_pred)\n else:\n y_pred_tree = 0.0\n \n final_pred = y_pred_linear + y_pred_tree\n \n # Clip to min/max observed to ensure stability\n final_pred = np.clip(final_pred, col_mins[j], col_maxs[j])\n \n X_imp[mis_mask, j] = final_pred\n \n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n \n return X_imp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).transform(X)\n\n\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr",
6 "filename": "scikit-learn/custom_imputation.py",
7 "start_line": 40
8}OK: Replaced lines 40..153 with 130 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–173.
[Current file: scikit-learn/custom_imputation.py | editable: 36–173 | total: 303 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (132 more lines) ...
171 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
172 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
173 return corr[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)
### breast_cancer (scripts/breast_cancer.sh)
[COMMAND FAILED exit=1]
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-missing-data-imputation/google_gemini-3.1-pro-preview_20260414_110948/scikit-learn/custom_imputation.py", line 170
if mask.sum() > 2:
IndentationError: unexpected indent
ERROR conda.cli.main_run:execute(142): `conda run bash /srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-missing-data-imputation/google_gemini-3.1-pro-preview_20260414_110948/.local_scripts/scripts/breast_cancer.sh` failed. (See above for error)
### wine (scripts/wine.sh)
[COMMAND FAILED exit=1]
File "/srv/home/bohanlyu/MLS-Bench/vendor/workspace/ml-missing-data-imputation/google_gemini-3.1-pro-preview_20260414_110948/scikit-learn/custom_imputation.py", line 170
if mask.sum() > 2:
IndentationError: unexpected indent
ERROR conda.cli.main_run:execute(142): `conda run bash /srv/home/bohanlyu/ML
Tool Input
1{
2 "op": "replace",
3 "after_line": 164,
4 "content": "def compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr",
5 "start_line": 164,
6 "end_line": 173,
7 "filename": "scikit-learn/custom_imputation.py"
8}OK: Replaced lines 164..173 with 10 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–173.
[Current file: scikit-learn/custom_imputation.py | editable: 36–173 | total: 303 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (132 more lines) ...
171 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
172 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
173 return corrTool Input
1{
2 "filename": "scikit-learn/custom_imputation.py",
3 "start_line": 164,
4 "after_line": 164,
5 "content": "def compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr",
6 "end_line": 173,
7 "op": "replace"
8}OK: Replaced lines 164..173 with 10 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–173.
[Current file: scikit-learn/custom_imputation.py | editable: 36–173 | total: 303 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (132 more lines) ...
171 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
172 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
173 return corrTool Input
1{
2 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n def __init__(self, random_state=42, max_iter=10, alpha=1.0):\n self.random_state = random_state\n self.max_iter = max_iter\n self.alpha = alpha\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n return self\n\n def transform(self, X):\n from sklearn.linear_model import Ridge\n from sklearn.ensemble import HistGradientBoostingRegressor\n \n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n \n col_means = np.nanmean(X_imp, axis=0)\n col_means[np.isnan(col_means)] = 0.0\n col_mins = np.nanmin(X, axis=0)\n col_maxs = np.nanmax(X, axis=0)\n \n mask = np.isnan(X)\n for j in range(n_features):\n mask_j = mask[:, j]\n if mask_j.any():\n X_imp[mask_j, j] = col_means[j]\n\n miss_count = mask.sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n features_with_missing = features_with_missing[np.argsort(miss_count[features_with_missing])]\n \n if len(features_with_missing) == 0:\n return X_imp\n\n mask_f = mask.astype(float)\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n \n for j in features_with_missing:\n mis_mask = mask[:, j]\n obs_mask = ~mis_mask\n \n if mis_mask.sum() == 0:\n continue\n \n other_features = [k for k in range(n_features) if k != j]\n \n X_train_base = X_imp[obs_mask][:, other_features]\n X_train_mask = mask_f[obs_mask][:, other_features]\n X_train = np.hstack((X_train_base, X_train_mask))\n y_train = X[obs_mask, j]\n \n X_pred_base = X_imp[mis_mask][:, other_features]\n X_pred_mask = mask_f[mis_mask][:, other_features]\n X_pred = np.hstack((X_pred_base, X_pred_mask))\n \n ridge = Ridge(alpha=self.alpha)\n ridge.fit(X_train, y_train)\n y_pred_linear = ridge.predict(X_pred)\n y_train_linear = ridge.predict(X_train)\n \n res = y_train - y_train_linear\n if np.std(res) > 1e-5:\n tree = HistGradientBoostingRegressor(\n max_iter=50, \n max_depth=5, \n random_state=self.random_state + iteration,\n min_samples_leaf=3\n )\n tree.fit(X_train, res)\n y_pred_tree = tree.predict(X_pred)\n else:\n y_pred_tree = 0.0\n \n final_pred = y_pred_linear + y_pred_tree\n final_pred = np.clip(final_pred, col_mins[j], col_maxs[j])\n \n X_imp[mis_mask, j] = final_pred\n \n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n \n return X_imp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).transform(X)\n\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr\n",
3 "after_line": 38,
4 "end_line": 173,
5 "filename": "scikit-learn/custom_imputation.py",
6 "start_line": 39,
7 "op": "replace"
8}OK: Replaced lines 39..173 with 101 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–139.
[Current file: scikit-learn/custom_imputation.py | editable: 36–139 | total: 269 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (98 more lines) ...
137 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
138 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
139 return corrTool Input
1{
2 "after_line": 38,
3 "end_line": 139,
4 "start_line": 39,
5 "op": "replace",
6 "filename": "scikit-learn/custom_imputation.py",
7 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Residual-Boosted Hybrid Iterative Imputation.\n \n Combines strong global linear modeling (Bayesian Ridge regression) with \n robust non-linear modeling (Extremely Randomized Trees) on the residuals.\n \n Motivation:\n Linear models (like BayesianRidge in MICE) are highly effective at \n capturing multicollinearity and exact linear dependencies (which dominates \n some datasets like Breast Cancer). However, they fail on complex non-linear \n relationships. Tree-based models (like ExtraTrees in MissForest) excel at \n non-linear local interactions but struggle to capture global linear \n covariance patterns cleanly. \n This hybrid regressor first fits a regularized linear model, then applies\n ExtraTrees to its residuals. We wrap this in an iterative imputation \n scheme with target clipping to prevent divergence. We also supply original \n missingness indicator masks to the regressors, which is crucial for \n handling correlated missingness patterns.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n return self\n\n def transform(self, X):\n from sklearn.linear_model import BayesianRidge\n from sklearn.ensemble import ExtraTreesRegressor\n \n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n \n col_means = np.nanmean(X_imp, axis=0)\n col_means[np.isnan(col_means)] = 0.0\n col_mins = np.nanmin(X, axis=0)\n col_maxs = np.nanmax(X, axis=0)\n \n mask = np.isnan(X)\n for j in range(n_features):\n mask_j = mask[:, j]\n if mask_j.any():\n X_imp[mask_j, j] = col_means[j]\n\n miss_count = mask.sum(axis=0)\n features_with_missing = np.where(miss_count > 0)[0]\n features_with_missing = features_with_missing[np.argsort(miss_count[features_with_missing])]\n \n if len(features_with_missing) == 0:\n return X_imp\n\n mask_f = mask.astype(float)\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n \n for j in features_with_missing:\n mis_mask = mask[:, j]\n obs_mask = ~mis_mask\n \n if mis_mask.sum() == 0:\n continue\n \n other_features = [k for k in range(n_features) if k != j]\n \n X_train_base = X_imp[obs_mask][:, other_features]\n X_train_mask = mask_f[obs_mask][:, other_features]\n X_train = np.hstack((X_train_base, X_train_mask))\n y_train = X[obs_mask, j]\n \n X_pred_base = X_imp[mis_mask][:, other_features]\n X_pred_mask = mask_f[mis_mask][:, other_features]\n X_pred = np.hstack((X_pred_base, X_pred_mask))\n \n # 1. Global Linear Structure\n br = BayesianRidge()\n br.fit(X_train, y_train)\n y_pred_linear = br.predict(X_pred)\n y_train_linear = br.predict(X_train)\n \n # 2. Local Non-linear corrections on residuals\n res = y_train - y_train_linear\n if np.std(res) > 1e-5:\n tree = ExtraTreesRegressor(\n n_estimators=50, \n max_features=\"sqrt\",\n min_samples_leaf=2,\n random_state=self.random_state + iteration,\n n_jobs=-1\n )\n tree.fit(X_train, res)\n y_pred_tree = tree.predict(X_pred)\n else:\n y_pred_tree = 0.0\n \n final_pred = y_pred_linear + y_pred_tree\n final_pred = np.clip(final_pred, col_mins[j], col_maxs[j])\n \n X_imp[mis_mask, j] = final_pred\n \n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n \n return X_imp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).transform(X)\n\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr"
8}OK: Replaced lines 39..139 with 123 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–161.
[Current file: scikit-learn/custom_imputation.py | editable: 36–161 | total: 291 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (120 more lines) ...
159 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
160 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
161 return corr