Agent Conversation: qwen/qwen3.6-plus
Task: ml-missing-data-imputation
Missing Data Imputation
Research Question
Design a novel missing data imputation method that achieves low reconstruction error and preserves downstream predictive performance across diverse tabular datasets.
Background
Missing data is ubiquitous in real-world datasets. Simple approaches like mean/median imputation ignore feature correlations, while iterative predictive methods can capture them more directly. This task evaluates imputation methods that:
- Captures complex inter-feature dependencies
- Works well on datasets of varying sizes and feature types
- Produces imputations that preserve the statistical structure needed for downstream tasks
Task
Implement a custom imputation algorithm in the CustomImputer class in custom_imputation.py. The class follows the scikit-learn transformer interface: fit(X) learns from data with missing values (NaN), and transform(X) returns a complete matrix with no NaN values.
Interface
class CustomImputer(BaseEstimator, TransformerMixin):
def __init__(self, random_state=42, max_iter=10):
...
def fit(self, X, y=None):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Learn imputation model
return self
def transform(self, X):
# X: numpy array (n_samples, n_features) with NaN for missing values
# Return: numpy array (n_samples, n_features) with NO NaN values
return X_imputed
Available libraries: numpy, scipy, scikit-learn (all submodules including sklearn.impute, sklearn.ensemble, sklearn.neighbors, etc.).
Evaluation
Evaluated on three datasets with 20% MCAR (Missing Completely At Random) missing values:
- Breast Cancer Wisconsin (569 samples, 30 features, binary classification)
- Wine (178 samples, 13 features, 3-class classification)
- California Housing (5000 samples, 8 features, regression)
Two metrics per dataset:
- RMSE: Root Mean Squared Error between imputed and true values (lower is better)
- downstream_score: Classification accuracy (breast_cancer, wine) or R^2 (california) using GradientBoosting on the imputed data (higher is better)
scikit-learn/custom_imputation.py [EDITABLE — lines 36–131 only]
1: """Custom missing data imputation benchmark.
2:
3: This script evaluates a missing data imputation method across multiple datasets
4: with artificially introduced missing values. The agent should modify the EDITABLE
5: section to implement a novel imputation algorithm.
6:
7: Datasets (selected by $ENV):
8: - breast_cancer: Classification, 569 samples x 30 features (binary)
9: - wine: Classification, 178 samples x 13 features (3-class)
10: - california: Regression, 20640 samples x 8 features (continuous target)
11:
12: Missing patterns: MCAR (Missing Completely At Random) at 20% rate.
13:
14: Metrics:
15: - rmsTool Input
1{
2 "op": "replace",
3 "filename": "LLaMA-Factory/src/custom_imputation.py",
4 "start_line": 41,
5 "end_line": 109,
6 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Correlation-guided ordering: Features with higher average correlation\n to other features are imputed first (easier to predict).\n 2) Multi-model ensemble: For each feature, train Ridge, RandomForest,\n and KNN imputers, weighted by cross-validated R² on observed data.\n 3) Residual-variance calibration: Scale imputed values to preserve\n the conditional variance structure.\n 4) Progressive self-training: After initial imputation, iteratively\n refine using increasingly confident predictions as pseudo-observed data.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self._learn(X)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _learn(self, X):\n \"\"\"Precompute statistics used during imputation.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n self.col_stds_ = np.nanstd(X, axis=0)\n self.col_stds_[self.col_stds_ < 1e-10] = 1.0\n self.corr_ = compute_feature_correlations(X)\n # Store average absolute correlation for ordering\n self.avg_corr_ = np.abs(self.corr_).mean(axis=0)\n\n def _impute(self, X):\n \"\"\"Perform correlation-adaptive ensemble imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Initial imputation with column means\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n # Identify features with missing values, order by avg correlation\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n # Sort by average absolute correlation (descending = easier to predict first)\n order = has_missing[np.argsort(-self.avg_corr_[has_missing])]\n\n # Precompute pairwise distances for KNN (using observed entries)\n X_obs_mean = X.copy()\n for j in range(n_features):\n mask = np.isnan(X_obs_mean[:, j])\n X_obs_mean[mask, j] = self.col_means_[j]\n\n # Main iterative imputation loop\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n\n # Build ensemble imputation for feature j\n imputed_j = self._ensemble_impute(\n X_imp, obs_mask, mis_mask, other_features, j, rng\n )\n\n # Apply residual-variance calibration\n obs_vals = X[obs_mask, j]\n pred_obs = self._predict_observed(\n X_imp, obs_mask, other_features, j, rng\n )\n\n if len(obs_vals) > 5:\n obs_var = np.var(obs_vals)\n pred_var = np.var(pred_obs)\n pred_mean = np.mean(pred_obs)\n\n if pred_var > 0 and obs_var > 0:\n # Scale imputed values to match observed variance\n residual_std = np.sqrt(max(obs_var - pred_var, 0))\n noise = rng.normal(0, residual_std, size=mis_mask.sum())\n X_imp[mis_mask, j] = pred_obs * np.sqrt(obs_var / max(pred_var, 1e-10)) + noise\n else:\n X_imp[mis_mask, j] = imputed_j\n else:\n X_imp[mis_mask, j] = imputed_j\n\n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n # Final calibration: ensure each feature has correct mean and variance\n for j in range(n_features):\n if miss_counts[j] > 0:\n obs_mask = ~np.isnan(X[:, j])\n if obs_mask.sum() > 0:\n obs_std = np.std(X[obs_mask, j])\n if obs_std > 1e-10:\n # Match mean and standard deviation\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n target_mean = np.mean(X[obs_mask, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + target_mean\n else:\n X_imp[:, j] = target_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_imp, obs_mask, mis_mask, other_features, j, rng):\n \"\"\"Get weighted ensemble prediction for missing values.\"\"\"\n X_train = X_imp[obs_mask][:, other_features]\n X_pred = X_imp[mis_mask][:, other_features]\n y_train = X_imp[obs_mask, j]\n\n predictions = []\n weights = []\n\n # Ridge regression: good for linear relationships\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n pred_ridge = ridge.predict(X_pred)\n predictions.append(pred_ridge)\n # Weight by R²\n pred_obs = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_obs) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # Random Forest: good for non-linear relationships\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n pred_rf = rf.predict(X_pred)\n predictions.append(pred_rf)\n pred_obs = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_obs) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # KNN: good for local structure\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n pred_knn = knn.predict(X_pred)\n predictions.append(pred_knn)\n pred_obs = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_obs) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # If no models succeeded, return mean\n if len(predictions) == 0:\n return np.full(mis_mask.sum(), self.col_means_[j])\n\n # Weighted average\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n result = sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n return result\n\n def _predict_observed(self, X_imp, obs_mask, other_features, j, rng):\n \"\"\"Get predictions on observed data (for variance calibration).\"\"\"\n X_obs = X_imp[obs_mask][:, other_features]\n\n predictions = []\n weights = []\n\n # Ridge\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n # Leave-one-out style prediction for observed data\n n = X_obs.shape[0]\n preds = np.zeros(n)\n chunk_size = max(10, n // 4)\n idx = np.arange(n)\n for start in range(0, n, chunk_size):\n end = min(start + chunk_size, n)\n val_idx = idx[start:end]\n train_idx = np.concatenate([idx[:start], idx[end:]])\n if len(train_idx) < 5:\n continue\n ridge_local = Ridge(alpha=1.0)\n ridge_local.fit(X_obs[train_idx], X_imp[obs_mask, j][train_idx])\n preds[val_idx] = ridge_local.predict(X_obs[val_idx])\n predictions.append(preds)\n weights.append(1.0)\n except Exception:\n pass\n\n # RF\n if X_obs.shape[0] > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_obs.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_obs, X_imp[obs_mask, j])\n pred_rf = rf.predict(X_obs)\n predictions.append(pred_rf)\n weights.append(1.0)\n except Exception:\n pass\n\n if len(predictions) == 0:\n return X_imp[obs_mask, j]\n\n weight_sum = sum(weights)\n result = sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n return resul\n... (truncated)"
7}ERROR: Package 'LLaMA-Factory' is not in allowed packages
Tool Input
1{
2 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Correlation-guided ordering: Features with higher average correlation\n to other features are imputed first (easier to predict).\n 2) Multi-model ensemble: For each feature, train Ridge, RandomForest,\n and KNN imputers, weighted by cross-validated R² on observed data.\n 3) Residual-variance calibration: Scale imputed values to preserve\n the conditional variance structure.\n 4) Progressive self-training: After initial imputation, iteratively\n refine using increasingly confident predictions as pseudo-observed data.\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self._learn(X)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _learn(self, X):\n \"\"\"Precompute statistics used during imputation.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n self.col_stds_ = np.nanstd(X, axis=0)\n self.col_stds_[self.col_stds_ < 1e-10] = 1.0\n self.corr_ = compute_feature_correlations(X)\n\n def _impute(self, X):\n \"\"\"Perform correlation-adaptive ensemble imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Initial imputation with column means\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n # Identify features with missing values\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n # Precompute ordering by missingness count (fewer missing = predict first)\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n # Main iterative imputation loop\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j] # Use original observed values\n X_pred = X_imp[mis_mask][:, other_features]\n\n # Build ensemble imputation for feature j\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n\n X_imp[mis_mask, j] = ensemble_pred\n\n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n # Final calibration: ensure each feature has correct mean and variance\n for j in range(n_features):\n if miss_counts[j] > 0:\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n else:\n X_imp[:, j] = obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Get weighted ensemble prediction for missing values.\"\"\"\n predictions = []\n weights = []\n\n # Ridge regression: good for linear relationships\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n pred_ridge = ridge.predict(X_pred)\n predictions.append(pred_ridge)\n # Weight by R² on training data\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # Random Forest: good for non-linear relationships\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n pred_rf = rf.predict(X_pred)\n predictions.append(pred_rf)\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # KNN: good for local structure\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n pred_knn = knn.predict(X_pred)\n predictions.append(pred_knn)\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # If no models succeeded, return mean\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n # Weighted average\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n result = sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n return result",
3 "end_line": 109,
4 "filename": "LLaMA-Factory/src/custom_imputation.py",
5 "op": "replace",
6 "start_line": 41
7}ERROR: Package 'LLaMA-Factory' is not in allowed packages
Tool Input
1{
2 "op": "replace",
3 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Correlation-guided ordering: Features imputed by missingness count\n 2) Multi-model ensemble: Ridge + RandomForest + KNN, weighted by R²\n 3) Post-calibration: restore marginal moments to match observed data\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self._learn(X)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _learn(self, X):\n \"\"\"Precompute statistics used during imputation.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n self.col_stds_ = np.nanstd(X, axis=0)\n self.col_stds_[self.col_stds_ < 1e-10] = 1.0\n\n def _impute(self, X):\n \"\"\"Perform ensemble-iterative imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Initial imputation with column means\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n # Identify features with missing values\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n # Order by missingness count (fewer missing = predict first)\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n # Main iterative imputation loop\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j] # Use original observed values\n X_pred = X_imp[mis_mask][:, other_features]\n\n # Build ensemble imputation\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n\n X_imp[mis_mask, j] = ensemble_pred\n\n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n # Final calibration: match marginal moments\n for j in range(n_features):\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0 and miss_counts[j] > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Weighted ensemble of Ridge + RandomForest + KNN.\"\"\"\n predictions = []\n weights = []\n\n # Ridge regression\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n pred_ridge = ridge.predict(X_pred)\n predictions.append(pred_ridge)\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # Random Forest\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n pred_rf = rf.predict(X_pred)\n predictions.append(pred_rf)\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # KNN\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n pred_knn = knn.predict(X_pred)\n predictions.append(pred_knn)\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n result = sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n return result",
4 "end_line": 109,
5 "filename": "scikit_learn/custom_imputation.py",
6 "start_line": 41
7}ERROR: File not editable: scikit_learn/custom_imputation.py
Tool Input
1{
2 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Multi-model ensemble: Ridge + RandomForest + KNN, weighted by R²\n 2) Post-calibration: restore marginal moments to match observed data\n 3) Iterative refinement with convergence checking\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self._learn(X)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _learn(self, X):\n \"\"\"Precompute statistics used during imputation.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n\n def _impute(self, X):\n \"\"\"Perform ensemble-iterative imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Initial imputation with column means\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n # Identify features with missing values\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n # Order by missingness count (fewer missing = predict first)\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n # Main iterative imputation loop\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j]\n X_pred = X_imp[mis_mask][:, other_features]\n\n # Build ensemble imputation\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n\n X_imp[mis_mask, j] = ensemble_pred\n\n # Convergence check\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n # Final calibration: match marginal moments\n for j in range(n_features):\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0 and miss_counts[j] > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Weighted ensemble of Ridge + RandomForest + KNN.\"\"\"\n predictions = []\n weights = []\n\n # Ridge regression\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n pred_ridge = ridge.predict(X_pred)\n predictions.append(pred_ridge)\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # Random Forest\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n pred_rf = rf.predict(X_pred)\n predictions.append(pred_rf)\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # KNN\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n pred_knn = knn.predict(X_pred)\n predictions.append(pred_knn)\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n result = sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n return result",
3 "end_line": 109,
4 "filename": "custom_imputation.py",
5 "op": "replace",
6 "start_line": 41
7}ERROR: Package 'custom_imputation.py' is not in allowed packages
Tool Input
1{
2 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Multi-model ensemble: Ridge + RandomForest + KNN, weighted by R²\n 2) Post-calibration: restore marginal moments to match observed data\n 3) Iterative refinement with convergence checking\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _impute(self, X):\n \"\"\"Perform ensemble-iterative imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Initial imputation with column means\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n # Identify features with missing values\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j]\n X_pred = X_imp[mis_mask][:, other_features]\n\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n X_imp[mis_mask, j] = ensemble_pred\n\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n # Final calibration: match marginal moments\n for j in range(n_features):\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0 and miss_counts[j] > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Weighted ensemble of Ridge + RandomForest + KNN.\"\"\"\n predictions = []\n weights = []\n\n # Ridge\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n predictions.append(ridge.predict(X_pred))\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # RandomForest\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n predictions.append(rf.predict(X_pred))\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # KNN\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n predictions.append(knn.predict(X_pred))\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n return sum(w * p for w, p in zip(weights, predictions)) / weight_sum",
3 "end_line": 109,
4 "filename": "LLaMA-Factory/src/custom_imputation.py",
5 "op": "replace",
6 "start_line": 41
7}ERROR: Package 'LLaMA-Factory' is not in allowed packages
Tool Input
1{
2 "filename": "custom_imputation.py",
3 "op": "create",
4 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Multi-model ensemble: Ridge + RandomForest + KNN, weighted by R²\n 2) Post-calibration: restore marginal moments to match observed data\n 3) Iterative refinement with convergence checking\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _impute(self, X):\n \"\"\"Perform ensemble-iterative imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n # Initial imputation with column means\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n # Identify features with missing values\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j]\n X_pred = X_imp[mis_mask][:, other_features]\n\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n X_imp[mis_mask, j] = ensemble_pred\n\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n # Final calibration: match marginal moments\n for j in range(n_features):\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0 and miss_counts[j] > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Weighted ensemble of Ridge + RandomForest + KNN.\"\"\"\n predictions = []\n weights = []\n\n # Ridge\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n predictions.append(ridge.predict(X_pred))\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # RandomForest\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n predictions.append(rf.predict(X_pred))\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n # KNN\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n predictions.append(knn.predict(X_pred))\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n return sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n"
5}ERROR: Package 'custom_imputation.py' is not in allowed packages
Tool Input
1{
2 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Multi-model ensemble: Ridge + RandomForest + KNN, weighted by R²\n 2) Post-calibration: restore marginal moments to match observed data\n 3) Iterative refinement with convergence checking\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _impute(self, X):\n \"\"\"Perform ensemble-iterative imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j]\n X_pred = X_imp[mis_mask][:, other_features]\n\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n X_imp[mis_mask, j] = ensemble_pred\n\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n for j in range(n_features):\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0 and miss_counts[j] > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Weighted ensemble of Ridge + RandomForest + KNN.\"\"\"\n predictions = []\n weights = []\n\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n predictions.append(ridge.predict(X_pred))\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n n_jobs=-1,\n )\n rf.fit(X_train, y_train)\n predictions.append(rf.predict(X_pred))\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n predictions.append(knn.predict(X_pred))\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n return sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n\n\n",
3 "end_line": 131,
4 "filename": "scikit-learn/custom_imputation.py",
5 "op": "replace",
6 "start_line": 36
7}OK: Replaced lines 36..131 with 142 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–177.
[Current file: scikit-learn/custom_imputation.py | editable: 36–177 | total: 307 lines]
36 class CustomImputer(BaseEstimator, TransformerMixin):
37 """Correlation-Adaptive Ensemble Imputer (CAEI).
38
... (136 more lines) ...
175 return sum(w * p for w, p in zip(weights, predictions)) / weight_sum
176
177 [Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### breast_cancer (scripts/breast_cancer.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting ### wine (scripts/wine.sh) Training metrics (last lines): TRAIN_METRICS stage=fitting
Tool Input
1{
2 "content": "class CustomImputer(BaseEstimator, TransformerMixin):\n \"\"\"Correlation-Adaptive Ensemble Imputer (CAEI).\n\n Novel imputation algorithm that combines:\n 1) Multi-model ensemble: Ridge + RandomForest + KNN, weighted by R²\n 2) Post-calibration: restore marginal moments to match observed data\n 3) Iterative refinement with convergence checking\n \"\"\"\n\n def __init__(self, random_state=42, max_iter=10):\n self.random_state = random_state\n self.max_iter = max_iter\n self.n_neighbors = 5\n self.n_estimators = 50\n self.tol = 1e-4\n\n def fit(self, X, y=None):\n \"\"\"Learn imputation model from data X.\"\"\"\n self.col_means_ = np.nanmean(X, axis=0)\n return self\n\n def transform(self, X):\n \"\"\"Impute missing values in X.\"\"\"\n return self._impute(X)\n\n def fit_transform(self, X, y=None):\n \"\"\"Fit and transform in one step.\"\"\"\n return self._impute(X)\n\n def _impute(self, X):\n \"\"\"Perform ensemble-iterative imputation.\"\"\"\n rng = np.random.RandomState(self.random_state)\n X_imp = X.copy()\n n_samples, n_features = X_imp.shape\n\n for j in range(n_features):\n mask_j = np.isnan(X_imp[:, j])\n if mask_j.any():\n X_imp[mask_j, j] = self.col_means_[j]\n\n miss_counts = np.isnan(X).sum(axis=0)\n has_missing = np.where(miss_counts > 0)[0]\n\n if len(has_missing) == 0:\n return X_imp\n\n order = has_missing[np.argsort(miss_counts[has_missing])]\n\n for iteration in range(self.max_iter):\n X_prev = X_imp.copy()\n\n for j in order:\n obs_mask = ~np.isnan(X[:, j])\n mis_mask = np.isnan(X[:, j])\n\n if mis_mask.sum() == 0:\n continue\n\n other_features = [k for k in range(n_features) if k != j]\n X_train = X_imp[obs_mask][:, other_features]\n y_train = X[obs_mask, j]\n X_pred = X_imp[mis_mask][:, other_features]\n\n ensemble_pred = self._ensemble_impute(\n X_train, y_train, X_pred, j, mis_mask.sum(), rng\n )\n X_imp[mis_mask, j] = ensemble_pred\n\n diff = np.sum((X_imp - X_prev) ** 2)\n denom = np.sum(X_imp ** 2)\n if denom > 0 and diff / denom < self.tol:\n break\n\n for j in range(n_features):\n obs_mask_final = ~np.isnan(X[:, j])\n if obs_mask_final.sum() > 0 and miss_counts[j] > 0:\n obs_std = np.std(X[obs_mask_final, j])\n obs_mean = np.mean(X[obs_mask_final, j])\n if obs_std > 1e-10:\n current_std = np.std(X_imp[:, j])\n current_mean = np.mean(X_imp[:, j])\n if current_std > 1e-10:\n X_imp[:, j] = (X_imp[:, j] - current_mean) * (obs_std / current_std) + obs_mean\n\n return X_imp\n\n def _ensemble_impute(self, X_train, y_train, X_pred, j, n_missing, rng):\n \"\"\"Weighted ensemble of Ridge + RandomForest + KNN.\"\"\"\n predictions = []\n weights = []\n\n try:\n from sklearn.linear_model import Ridge\n ridge = Ridge(alpha=1.0)\n ridge.fit(X_train, y_train)\n predictions.append(ridge.predict(X_pred))\n pred_train = ridge.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(y_train) > 10:\n try:\n from sklearn.ensemble import RandomForestRegressor\n rf = RandomForestRegressor(\n n_estimators=self.n_estimators,\n max_features=\"sqrt\",\n max_depth=min(10, max(3, X_train.shape[1])),\n random_state=self.random_state,\n )\n rf.fit(X_train, y_train)\n predictions.append(rf.predict(X_pred))\n pred_train = rf.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(y_train) >= self.n_neighbors:\n try:\n from sklearn.neighbors import KNeighborsRegressor\n knn = KNeighborsRegressor(n_neighbors=min(self.n_neighbors, len(y_train)))\n knn.fit(X_train, y_train)\n predictions.append(knn.predict(X_pred))\n pred_train = knn.predict(X_train)\n r2 = 1 - np.var(y_train - pred_train) / max(np.var(y_train), 1e-10)\n weights.append(max(r2, 0.1))\n except Exception:\n pass\n\n if len(predictions) == 0:\n return np.full(n_missing, self.col_means_[j])\n\n weight_sum = sum(weights)\n if weight_sum < 1e-10:\n return predictions[0]\n\n return sum(w * p for w, p in zip(weights, predictions)) / weight_sum\n\n\n",
3 "end_line": 177,
4 "filename": "scikit-learn/custom_imputation.py",
5 "op": "replace",
6 "start_line": 36
7}OK: Replaced lines 36..177 with 141 line(s) in scikit-learn/custom_imputation.py. Editable range: 36–176.
[Current file: scikit-learn/custom_imputation.py | editable: 36–176 | total: 306 lines]
36 class CustomImputer(BaseEstimator, TransformerMixin):
37 """Correlation-Adaptive Ensemble Imputer (CAEI).
38
... (135 more lines) ...
174 return sum(w * p for w, p in zip(weights, predictions)) / weight_sum
175
176