Agent Conversation: openai/gpt-5.4
Task: ml-missing-data-imputation
Missing Data Imputation
Research Question
Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.
Background
Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:
- Captures complex inter-feature dependencies
- Works well on datasets of varying sizes and feature types
- Produces imputations that preserve the statistical structure needed for downstream tasks
Task
Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.
Interface
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=42, max_iter=10):
...
def fit(self, X, y=None):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Learn imputation model
return self
def transform(self, X):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Return: numpy array (n_samples, n_features) with NO NaN values
return X_imputed
Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).
Evaluation
Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:
- Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
- Wine (178 samples, 13 features, 3-class classification)
- California Housing (5000 samples, 8 features, regression)
Two metrics per dataset:
- RMSE: Root Mean Squared Error between imputed and true values (lower is better)
- downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)
scikit-learn/custom_imputation.py [EDITABLE — lines 36–131 only]
1: """Custom missing data imputation benchmark.
2:
3: This script evaluates a missing data imputation method across multiple datasets
4: with artificially introduced missing values. The agent should modify the EDITABLE
5: section to implement a novel imputation algorithm.
6:
7: Datasets (selected by $ENV):
8: - breast_cancer: Classification, 569 samples x 30 features (binary)
9: - wine: Classification, 178 samples x 13 features (3-class)
10: - california: Regression, 20640 samples x 8 features (continuous target)
11:
12: Missing patterns: MCAR (Missing Completely At Random) at 20% rate.
13:
14: Metrics:
15: - rmsTool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_imputation.py",
4 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Self-calibrated hybrid imputer.\n\n Combines a correlation-restricted Bayesian regression update (global structure)\n with distance-weighted donor averaging (local structure). The blend weight is\n learned per feature via self-supervised masking of observed entries.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.max_predictors = 8\n self.damping = 0.7\n\n def _fill(self, X):\n Xf = np.asarray(X, dtype=float).copy()\n mask = np.isnan(Xf)\n if mask.any():\n Xf[mask] = np.take(self.statistics_, np.where(mask)[1])\n return Xf\n\n def _pred_set(self, corr, j):\n idx = np.delete(np.arange(corr.shape[0]), j)\n k = min(self.max_predictors, len(idx))\n return idx[np.argsort(-np.abs(corr[j, idx]))[:k]]\n\n def _knn(self, Xtr, ytr, Xte):\n from sklearn.neighbors import NearestNeighbors\n\n if Xte.shape[0] == 0:\n return np.zeros(0), np.zeros(0)\n if Xtr is None or len(ytr) == 0:\n return np.full(Xte.shape[0], np.mean(ytr) if len(ytr) else 0.0), np.zeros(Xte.shape[0])\n k = min(15, max(3, int(np.sqrt(len(ytr)))), len(ytr))\n nn = NearestNeighbors(n_neighbors=k)\n nn.fit(Xtr)\n d, idx = nn.kneighbors(Xte)\n w = 1.0 / (d + 0.1)\n w /= np.maximum(w.sum(axis=1, keepdims=True), 1e-12)\n return (w * ytr[idx]).sum(axis=1), np.exp(-np.median(d, axis=1))\n\n def _calibrate(self, X):\n from sklearn.linear_model import BayesianRidge\n\n rng = np.random.RandomState(self.random_state)\n obs = ~np.isnan(X)\n cal = np.zeros_like(obs, dtype=bool)\n for j in range(X.shape[1]):\n idx = np.where(obs[:, j])[0]\n m = min(20, max(1, len(idx) // 8), max(0, len(idx) - 5))\n if m > 0:\n cal[rng.choice(idx, size=m, replace=False), j] = True\n if not cal.any():\n return np.full(X.shape[1], 0.5)\n Xc = X.copy()\n Xc[cal] = np.nan\n Xs = self._fill(Xc)\n corr = np.nan_to_num(np.corrcoef(Xs, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n weights = np.full(X.shape[1], 0.5)\n for j in range(X.shape[1]):\n test = cal[:, j]\n obsj = ~np.isnan(Xc[:, j])\n if test.sum() == 0 or obsj.sum() < 5:\n continue\n pred = self._pred_set(corr, j)\n model = BayesianRidge().fit(Xs[obsj][:, pred], X[obsj, j])\n g = model.predict(Xs[test][:, pred])\n l, _ = self._knn(Xs[obsj][:, pred], X[obsj, j], Xs[test][:, pred])\n eg = np.sqrt(np.mean((g - X[test, j]) ** 2)) + 1e-6\n el = np.sqrt(np.mean((l - X[test, j]) ** 2)) + 1e-6\n weights[j] = (1.0 / el) / ((1.0 / el) + (1.0 / eg))\n return np.clip(0.15 + 0.7 * weights, 0.15, 0.85)\n\n def _fit_impute(self, X):\n from sklearn.linear_model import BayesianRidge\n\n miss = np.isnan(X)\n Ximp = self._fill(X)\n order = np.argsort(miss.mean(axis=0))\n for _ in range(self.max_iter):\n prev = Ximp.copy()\n corr = np.nan_to_num(np.corrcoef(Ximp, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n for j in order:\n mis, obs = miss[:, j], ~miss[:, j]\n if not mis.any() or obs.sum() < 5:\n continue\n pred = self._pred_set(corr, j)\n model = BayesianRidge().fit(Ximp[obs][:, pred], X[obs, j])\n g = model.predict(Ximp[mis][:, pred])\n l, closeness = self._knn(Ximp[obs][:, pred], X[obs, j], Ximp[mis][:, pred])\n w = np.clip(self.weights_[j] * (0.3 + 0.7 * closeness), 0.05, 0.95)\n vals = np.clip((1.0 - w) * g + w * l, self.lo_[j], self.hi_[j])\n Ximp[mis, j] = (1.0 - self.damping) * Ximp[mis, j] + self.damping * vals\n if miss.any() and np.sqrt(np.mean((Ximp[miss] - prev[miss]) ** 2)) < 1e-3:\n break\n return Ximp\n\n def _fit_models(self, X):\n from sklearn.linear_model import BayesianRidge\n\n corr = np.nan_to_num(np.corrcoef(self.X_train_imputed_, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n self.predictors_, self.models_, self.donor_X_, self.donor_y_ = [], [], [], []\n for j in range(X.shape[1]):\n pred = self._pred_set(corr, j)\n obs = ~np.isnan(X[:, j])\n self.predictors_.append(pred)\n if obs.sum() < 2:\n self.models_.append(None)\n self.donor_X_.append(None)\n self.donor_y_.append(np.array([]))\n continue\n self.models_.append(BayesianRidge().fit(self.X_train_imputed_[obs][:, pred], X[obs, j]))\n self.donor_X_.append(self.X_train_imputed_[obs][:, pred])\n self.donor_y_.append(X[obs, j])\n\n def fit(self, X, y=None):\n X = np.asarray(X, dtype=float)\n self.statistics_ = np.nanmean(X, axis=0)\n self.statistics_ = np.where(np.isnan(self.statistics_), 0.0, self.statistics_)\n self.lo_ = np.nanpercentile(X, 1, axis=0)\n self.hi_ = np.nanpercentile(X, 99, axis=0)\n self.weights_ = self._calibrate(X)\n self.X_train_imputed_ = self._fit_impute(X)\n self._fit_models(X)\n return self\n\n def transform(self, X):\n X = np.asarray(X, dtype=float)\n miss = np.isnan(X)\n Ximp = self._fill(X)\n for _ in range(self.max_iter):\n prev = Ximp.copy()\n for j, pred in enumerate(self.predictors_):\n mis = miss[:, j]\n if not mis.any() or self.models_[j] is None:\n continue\n g = self.models_[j].predict(Ximp[mis][:, pred])\n l, closeness = self._knn(self.donor_X_[j], self.donor_y_[j], Ximp[mis][:, pred])\n w = np.clip(self.weights_[j] * (0.3 + 0.7 * closeness), 0.05, 0.95)\n vals = np.clip((1.0 - w) * g + w * l, self.lo_[j], self.hi_[j])\n Ximp[mis, j] = (1.0 - self.damping) * Ximp[mis, j] + self.damping * vals\n if miss.any() and np.sqrt(np.mean((Ximp[miss] - prev[miss]) ** 2)) < 1e-3:\n break\n return Ximp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).X_train_imputed_.copy()\n\n\n# Helper functions for the custom imputer (optional, agent may add more)\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr\n",
5 "after_line": 40,
6 "start_line": 41,
7 "end_line": 131
8}OK: Replaced lines 41..131 with 163 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–203.
[Current file: scikit-learn/custom_imputation.py | editable: 36–203 | total: 333 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (162 more lines) ...
201 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
202 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
203 return corr[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)
### breast_cancer (scripts/breast_cancer.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.472828
TRAIN_METRICS downstream_score=0.943782
TRAIN_METRICS baseline_no_missing=0.949076
Test results (breast_cancer):
rmse: 0.472828
downstream_score: 0.943782
### wine (scripts/wine.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.808852
TRAIN_METRICS downstream_score=0.927143
TRAIN_METRICS baseline_no_missing=0.921429
Test results (wine):
rmse: 0.808852
downstream_score: 0.927143
[Leaderboard] Results saved: {'rmse_breast_cancer': 0.472828, 'downstream_score_breast_cancer': 0.943782, 'elapsed_breast_cancer': 7, 'rmse_wine': 0.808852, 'downstream_score_wine': 0.927143, 'elapsed_wine': 7, 'rmse_california': 0.76177, 'downstream_score_california': 0.65629, 'elapsed_californi
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_imputation.py",
4 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Self-calibrated hybrid imputer.\n\n Per feature, it learns a three-part update:\n 1) correlation-restricted Bayesian regression,\n 2) correlation-restricted ExtraTrees regression,\n 3) distance-weighted donor averaging.\n\n Pseudo-masking of observed entries learns feature-wise mixture weights for the\n global linear/nonlinear experts and for the global/local blend.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.max_predictors = 8\n self.damping = 0.7\n self.n_trees = 40\n\n def _fill(self, X):\n Xf = np.asarray(X, dtype=float).copy()\n mask = np.isnan(Xf)\n if mask.any():\n Xf[mask] = np.take(self.statistics_, np.where(mask)[1])\n return Xf\n\n def _pred_set(self, corr, j):\n idx = np.delete(np.arange(corr.shape[0]), j)\n k = min(max(1, int(np.sqrt(corr.shape[0]) * 2)), self.max_predictors, len(idx))\n return idx[np.argsort(-np.abs(corr[j, idx]))[:k]]\n\n def _make_tree(self):\n from sklearn.ensemble import ExtraTreesRegressor\n\n return ExtraTreesRegressor(\n n_estimators=self.n_trees,\n max_features=\"sqrt\",\n random_state=self.random_state,\n n_jobs=-1,\n )\n\n def _fit_global_models(self, Xtr, ytr):\n from sklearn.linear_model import BayesianRidge\n\n return BayesianRidge().fit(Xtr, ytr), self._make_tree().fit(Xtr, ytr)\n\n def _global_pred(self, models, Xte, j):\n pr = models[0].predict(Xte)\n pt = models[1].predict(Xte)\n return (1.0 - self.tree_weights_[j]) * pr + self.tree_weights_[j] * pt\n\n def _knn(self, Xtr, ytr, Xte):\n from sklearn.neighbors import NearestNeighbors\n\n if Xte.shape[0] == 0:\n return np.zeros(0), np.zeros(0)\n if Xtr is None or len(ytr) == 0:\n return np.full(Xte.shape[0], np.mean(ytr) if len(ytr) else 0.0), np.zeros(Xte.shape[0])\n k = min(15, max(3, int(np.sqrt(len(ytr)))), len(ytr))\n nn = NearestNeighbors(n_neighbors=k)\n nn.fit(Xtr)\n d, idx = nn.kneighbors(Xte)\n w = 1.0 / (d + 0.1)\n w /= np.maximum(w.sum(axis=1, keepdims=True), 1e-12)\n return (w * ytr[idx]).sum(axis=1), np.exp(-np.median(d, axis=1))\n\n def _calibrate(self, X):\n rng = np.random.RandomState(self.random_state)\n obs = ~np.isnan(X)\n cal = np.zeros_like(obs, dtype=bool)\n for j in range(X.shape[1]):\n idx = np.where(obs[:, j])[0]\n m = min(20, max(1, len(idx) // 8), max(0, len(idx) - 5))\n if m > 0:\n cal[rng.choice(idx, size=m, replace=False), j] = True\n if not cal.any():\n base = np.full(X.shape[1], 0.5)\n return base, base\n Xc = X.copy()\n Xc[cal] = np.nan\n Xs = self._fill(Xc)\n corr = np.nan_to_num(np.corrcoef(Xs, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n local_w = np.full(X.shape[1], 0.5)\n tree_w = np.full(X.shape[1], 0.5)\n for j in range(X.shape[1]):\n test = cal[:, j]\n obsj = ~np.isnan(Xc[:, j])\n if test.sum() == 0 or obsj.sum() < 8:\n continue\n pred = self._pred_set(corr, j)\n models = self._fit_global_models(Xs[obsj][:, pred], X[obsj, j])\n pr = models[0].predict(Xs[test][:, pred])\n pt = models[1].predict(Xs[test][:, pred])\n er = np.sqrt(np.mean((pr - X[test, j]) ** 2)) + 1e-6\n et = np.sqrt(np.mean((pt - X[test, j]) ** 2)) + 1e-6\n tw = (1.0 / et) / ((1.0 / er) + (1.0 / et))\n g = (1.0 - tw) * pr + tw * pt\n l, _ = self._knn(Xs[obsj][:, pred], X[obsj, j], Xs[test][:, pred])\n eg = np.sqrt(np.mean((g - X[test, j]) ** 2)) + 1e-6\n el = np.sqrt(np.mean((l - X[test, j]) ** 2)) + 1e-6\n tree_w[j] = tw\n local_w[j] = (1.0 / el) / ((1.0 / el) + (1.0 / eg))\n return np.clip(0.1 + 0.8 * local_w, 0.1, 0.9), np.clip(0.1 + 0.8 * tree_w, 0.1, 0.9)\n\n def _fit_impute(self, X):\n miss = np.isnan(X)\n Ximp = self._fill(X)\n order = np.argsort(miss.mean(axis=0))\n for _ in range(self.max_iter):\n prev = Ximp.copy()\n corr = np.nan_to_num(np.corrcoef(Ximp, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n for j in order:\n mis, obs = miss[:, j], ~miss[:, j]\n if not mis.any() or obs.sum() < 8:\n continue\n pred = self._pred_set(corr, j)\n models = self._fit_global_models(Ximp[obs][:, pred], X[obs, j])\n g = self._global_pred(models, Ximp[mis][:, pred], j)\n l, closeness = self._knn(Ximp[obs][:, pred], X[obs, j], Ximp[mis][:, pred])\n w = np.clip(self.local_weights_[j] * (0.25 + 0.75 * closeness), 0.02, 0.98)\n vals = np.clip((1.0 - w) * g + w * l, self.lo_[j], self.hi_[j])\n Ximp[mis, j] = (1.0 - self.damping) * Ximp[mis, j] + self.damping * vals\n if miss.any() and np.sqrt(np.mean((Ximp[miss] - prev[miss]) ** 2)) < 1e-3:\n break\n return Ximp\n\n def _fit_models(self, X):\n corr = np.nan_to_num(np.corrcoef(self.X_train_imputed_, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n self.predictors_, self.models_, self.donor_X_, self.donor_y_ = [], [], [], []\n for j in range(X.shape[1]):\n pred = self._pred_set(corr, j)\n obs = ~np.isnan(X[:, j])\n self.predictors_.append(pred)\n if obs.sum() < 2:\n self.models_.append(None)\n self.donor_X_.append(None)\n self.donor_y_.append(np.array([]))\n continue\n self.models_.append(self._fit_global_models(self.X_train_imputed_[obs][:, pred], X[obs, j]))\n self.donor_X_.append(self.X_train_imputed_[obs][:, pred])\n self.donor_y_.append(X[obs, j])\n\n def fit(self, X, y=None):\n X = np.asarray(X, dtype=float)\n self.statistics_ = np.nanmean(X, axis=0)\n self.statistics_ = np.where(np.isnan(self.statistics_), 0.0, self.statistics_)\n self.lo_ = np.nanpercentile(X, 0.5, axis=0)\n self.hi_ = np.nanpercentile(X, 99.5, axis=0)\n self.local_weights_, self.tree_weights_ = self._calibrate(X)\n self.X_train_imputed_ = self._fit_impute(X)\n self._fit_models(X)\n return self\n\n def transform(self, X):\n X = np.asarray(X, dtype=float)\n miss = np.isnan(X)\n Ximp = self._fill(X)\n for _ in range(self.max_iter):\n prev = Ximp.copy()\n for j, pred in enumerate(self.predictors_):\n mis = miss[:, j]\n if not mis.any() or self.models_[j] is None:\n continue\n g = self._global_pred(self.models_[j], Ximp[mis][:, pred], j)\n l, closeness = self._knn(self.donor_X_[j], self.donor_y_[j], Ximp[mis][:, pred])\n w = np.clip(self.local_weights_[j] * (0.25 + 0.75 * closeness), 0.02, 0.98)\n vals = np.clip((1.0 - w) * g + w * l, self.lo_[j], self.hi_[j])\n Ximp[mis, j] = (1.0 - self.damping) * Ximp[mis, j] + self.damping * vals\n if miss.any() and np.sqrt(np.mean((Ximp[miss] - prev[miss]) ** 2)) < 1e-3:\n break\n return Ximp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).X_train_imputed_.copy()\n\n\n# Helper functions for the custom imputer (optional, agent may add more)\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr\n",
5 "after_line": 40,
6 "start_line": 41,
7 "end_line": 203
8}OK: Replaced lines 41..203 with 190 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–230.
[Current file: scikit-learn/custom_imputation.py | editable: 36–230 | total: 360 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (189 more lines) ...
228 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
229 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
230 return corr[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)
### breast_cancer (scripts/breast_cancer.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.477058
TRAIN_METRICS downstream_score=0.947275
TRAIN_METRICS baseline_no_missing=0.949076
Test results (breast_cancer):
rmse: 0.477058
downstream_score: 0.947275
### wine (scripts/wine.sh)
Training metrics (last lines):
TRAIN_METRICS stage=fitting
TRAIN_METRICS stage=done
TRAIN_METRICS rmse=0.823064
TRAIN_METRICS downstream_score=0.910476
TRAIN_METRICS baseline_no_missing=0.921429
Test results (wine):
rmse: 0.823064
downstream_score: 0.910476
[Leaderboard] Results saved: {'rmse_breast_cancer': 0.477058, 'downstream_score_breast_cancer': 0.947275, 'elapsed_breast_cancer': 47, 'rmse_wine': 0.823064, 'downstream_score_wine': 0.910476, 'elapsed_wine': 47, 'rmse_california': 0.759335, 'downstream_score_california': 0.657425, 'elapsed_califo
Tool Input
1{
2 "n": 1
3}Undo complete: Restored: scikit-learn/custom_imputation.py
Tool Input
1{
2 "op": "replace",
3 "filename": "scikit-learn/custom_imputation.py",
4 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Self-calibrated hybrid imputer.\n\n Combines a correlation-restricted Bayesian regression update (global structure)\n with distance-weighted donor averaging (local structure). The local/global\n blend is learned per feature via pseudo-masking of observed entries and then\n modulated by regression confidence and neighbor closeness.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.max_predictors = 12\n self.damping = 0.7\n\n def _fill(self, X):\n Xf = np.asarray(X, dtype=float).copy()\n mask = np.isnan(Xf)\n if mask.any():\n Xf[mask] = np.take(self.statistics_, np.where(mask)[1])\n return Xf\n\n def _pred_set(self, corr, j):\n idx = np.delete(np.arange(corr.shape[0]), j)\n k = min(max(1, int(np.sqrt(corr.shape[0]) * 2)), self.max_predictors, len(idx))\n return idx[np.argsort(-np.abs(corr[j, idx]))[:k]]\n\n def _knn(self, Xtr, ytr, Xte):\n from sklearn.neighbors import NearestNeighbors\n\n if Xte.shape[0] == 0:\n return np.zeros(0), np.zeros(0)\n if Xtr is None or len(ytr) == 0:\n return np.full(Xte.shape[0], np.mean(ytr) if len(ytr) else 0.0), np.zeros(Xte.shape[0])\n k = min(15, max(3, int(np.sqrt(len(ytr)))), len(ytr))\n nn = NearestNeighbors(n_neighbors=k)\n nn.fit(Xtr)\n d, idx = nn.kneighbors(Xte)\n w = 1.0 / (d + 0.1)\n w /= np.maximum(w.sum(axis=1, keepdims=True), 1e-12)\n return (w * ytr[idx]).sum(axis=1), np.exp(-np.median(d, axis=1))\n\n def _calibrate(self, X):\n from sklearn.linear_model import BayesianRidge\n\n rng = np.random.RandomState(self.random_state)\n obs = ~np.isnan(X)\n cal = np.zeros_like(obs, dtype=bool)\n for j in range(X.shape[1]):\n idx = np.where(obs[:, j])[0]\n m = min(20, max(1, len(idx) // 8), max(0, len(idx) - 5))\n if m > 0:\n cal[rng.choice(idx, size=m, replace=False), j] = True\n if not cal.any():\n return np.full(X.shape[1], 0.5)\n Xc = X.copy()\n Xc[cal] = np.nan\n Xs = self._fill(Xc)\n corr = np.nan_to_num(np.corrcoef(Xs, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n weights = np.full(X.shape[1], 0.5)\n for j in range(X.shape[1]):\n test = cal[:, j]\n obsj = ~np.isnan(Xc[:, j])\n if test.sum() == 0 or obsj.sum() < 5:\n continue\n pred = self._pred_set(corr, j)\n model = BayesianRidge().fit(Xs[obsj][:, pred], X[obsj, j])\n g = model.predict(Xs[test][:, pred])\n l, _ = self._knn(Xs[obsj][:, pred], X[obsj, j], Xs[test][:, pred])\n eg = np.sqrt(np.mean((g - X[test, j]) ** 2)) + 1e-6\n el = np.sqrt(np.mean((l - X[test, j]) ** 2)) + 1e-6\n weights[j] = (1.0 / el) / ((1.0 / el) + (1.0 / eg))\n return np.clip(0.15 + 0.7 * weights, 0.15, 0.85)\n\n def _fit_impute(self, X):\n from sklearn.linear_model import BayesianRidge\n\n miss = np.isnan(X)\n Ximp = self._fill(X)\n order = np.argsort(miss.mean(axis=0))\n for _ in range(self.max_iter):\n prev = Ximp.copy()\n corr = np.nan_to_num(np.corrcoef(Ximp, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n for j in order:\n mis, obs = miss[:, j], ~miss[:, j]\n if not mis.any() or obs.sum() < 5:\n continue\n pred = self._pred_set(corr, j)\n model = BayesianRidge().fit(Ximp[obs][:, pred], X[obs, j])\n conf = np.clip(model.score(Ximp[obs][:, pred], X[obs, j]), 0.0, 1.0)\n g = model.predict(Ximp[mis][:, pred])\n l, closeness = self._knn(Ximp[obs][:, pred], X[obs, j], Ximp[mis][:, pred])\n w = np.clip(self.weights_[j] * (0.25 + 0.75 * closeness) * (1.0 - 0.5 * conf), 0.05, 0.95)\n vals = np.clip((1.0 - w) * g + w * l, self.lo_[j], self.hi_[j])\n Ximp[mis, j] = (1.0 - self.damping) * Ximp[mis, j] + self.damping * vals\n if miss.any() and np.sqrt(np.mean((Ximp[miss] - prev[miss]) ** 2)) < 1e-3:\n break\n return Ximp\n\n def _fit_models(self, X):\n from sklearn.linear_model import BayesianRidge\n\n corr = np.nan_to_num(np.corrcoef(self.X_train_imputed_, rowvar=False))\n np.fill_diagonal(corr, 1.0)\n self.predictors_, self.models_, self.donor_X_, self.donor_y_, self.model_conf_ = [], [], [], [], []\n for j in range(X.shape[1]):\n pred = self._pred_set(corr, j)\n obs = ~np.isnan(X[:, j])\n self.predictors_.append(pred)\n if obs.sum() < 2:\n self.models_.append(None)\n self.donor_X_.append(None)\n self.donor_y_.append(np.array([]))\n self.model_conf_.append(0.0)\n continue\n model = BayesianRidge().fit(self.X_train_imputed_[obs][:, pred], X[obs, j])\n self.models_.append(model)\n self.donor_X_.append(self.X_train_imputed_[obs][:, pred])\n self.donor_y_.append(X[obs, j])\n self.model_conf_.append(float(np.clip(model.score(self.X_train_imputed_[obs][:, pred], X[obs, j]), 0.0, 1.0)))\n\n def fit(self, X, y=None):\n X = np.asarray(X, dtype=float)\n self.statistics_ = np.nanmean(X, axis=0)\n self.statistics_ = np.where(np.isnan(self.statistics_), 0.0, self.statistics_)\n self.lo_ = np.nanpercentile(X, 0.5, axis=0)\n self.hi_ = np.nanpercentile(X, 99.5, axis=0)\n self.weights_ = self._calibrate(X)\n self.X_train_imputed_ = self._fit_impute(X)\n self._fit_models(X)\n return self\n\n def transform(self, X):\n X = np.asarray(X, dtype=float)\n miss = np.isnan(X)\n Ximp = self._fill(X)\n for _ in range(self.max_iter):\n prev = Ximp.copy()\n for j, pred in enumerate(self.predictors_):\n mis = miss[:, j]\n if not mis.any() or self.models_[j] is None:\n continue\n g = self.models_[j].predict(Ximp[mis][:, pred])\n l, closeness = self._knn(self.donor_X_[j], self.donor_y_[j], Ximp[mis][:, pred])\n w = np.clip(self.weights_[j] * (0.25 + 0.75 * closeness) * (1.0 - 0.5 * self.model_conf_[j]), 0.05, 0.95)\n vals = np.clip((1.0 - w) * g + w * l, self.lo_[j], self.hi_[j])\n Ximp[mis, j] = (1.0 - self.damping) * Ximp[mis, j] + self.damping * vals\n if miss.any() and np.sqrt(np.mean((Ximp[miss] - prev[miss]) ** 2)) < 1e-3:\n break\n return Ximp\n\n def fit_transform(self, X, y=None):\n return self.fit(X, y).X_train_imputed_.copy()\n\n\n# Helper functions for the custom imputer (optional, agent may add more)\ndef compute_feature_correlations(X):\n n_features = X.shape[1]\n corr = np.eye(n_features)\n for i in range(n_features):\n for j in range(i + 1, n_features):\n mask = ~(np.isnan(X[:, i]) | np.isnan(X[:, j]))\n if mask.sum() > 2:\n c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]\n corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0\n return corr\n",
5 "after_line": 40,
6 "start_line": 41,
7 "end_line": 203
8}OK: Replaced lines 41..203 with 168 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–208.
[Current file: scikit-learn/custom_imputation.py | editable: 36–208 | total: 338 lines]
36 # ================================================================
37 # EDITABLE -- agent modifies this section (lines 36 to 142)
38 # ================================================================
... (167 more lines) ...
206 c = np.corrcoef(X[mask, i], X[mask, j])[0, 1]
207 corr[i, j] = corr[j, i] = c if not np.isnan(c) else 0.0
208 return corr[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### breast_cancer (scripts/breast_cancer.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.451992 TRAIN_METRICS downstream_score=0.952554 TRAIN_METRICS baseline_no_missing=0.949076 Test results (breast_cancer): rmse: 0.451992 downstream_score: 0.952554 ### wine (scripts/wine.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.822901 TRAIN_METRICS downstream_score=0.938413 TRAIN_METRICS baseline_no_missing=0.921429 Test results (wine): rmse: 0.822901 downstream_score: 0.938413 ### california (scripts/california.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.780218 TRAIN_METRICS downstrea
Tool Input
1{
2 "n": 3
3}[submit] Submitting result from test #3 as final. ### breast_cancer (scripts/breast_cancer.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.451992 TRAIN_METRICS downstream_score=0.952554 TRAIN_METRICS baseline_no_missing=0.949076 Test results (breast_cancer): rmse: 0.451992 downstream_score: 0.952554 ### wine (scripts/wine.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.822901 TRAIN_METRICS downstream_score=0.938413 TRAIN_METRICS baseline_no_missing=0.921429 Test results (wine): rmse: 0.822901 downstream_score: 0.938413 ### california (scripts/california.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting TRAIN_METRICS stage=done TRAIN_METRICS rmse=0.780218 TRAIN_METRICS downstream_score=0.656164 TRAIN_METRICS baseline_no_missing=0.762435 Test results (california): rmse: 0.780218 downstream_score: 0.656164 [submit] Finalized test #3 as fina