Agent Conversation: google/gemini-3.1-pro-preview
Task: ml-selective-deferral
Selective Deferral Under Subgroup Shift
Research Question
Design a practical selective prediction and deferral policy for high-stakes tabular decisions.
The task isolates one modular question: given a fixed base classifier, what acceptance / deferral rule best trades off selective risk, subgroup fairness, and overall discrimination?
Background
Selective prediction systems should be able to say "I do not know" when the classifier is uncertain. In high-stakes settings, that deferral can be handed to a human reviewer or a slower backup process. The benchmark studies whether a policy can:
- keep selective risk low at a fixed target coverage,
- avoid concentrating deferrals on one subgroup,
- preserve AUROC as a confidence ranking signal, and
- remain simple enough to run offline on modest compute.
Task
Modify the SelectivePolicy class in custom_selective.py. The rest of the pipeline is fixed: dataset loading, train / calibration / test splitting, base model training, and metric computation.
The policy receives calibration-time base-model probabilities and subgroup labels, then decides whether each test example should be accepted or deferred. You may implement a single global threshold, a learned deferral score, subgroup-specific thresholds, or any other compact policy that fits the interface.
Evaluation
The benchmark runs on four offline tabular proxies from scikit-learn:
breast_cancerdiabetes(binarized around the training-set median)california(binarized around the training-set median)madelon
Each dataset is split into train / calibration / test partitions. Subgroups are formed from a stable feature threshold so that worst-group behavior can be measured.
Metrics:
selective_risk_at80: classification error on accepted examples at 80% target coverageworst_group_selective_risk: worst subgroup error on accepted examplesdeferral_rate_gap: max subgroup deferral rate minus min subgroup deferral rateauroc: AUROC of the acceptance score for predicting correctness
Baselines
confidence_thresholding: tune one confidence threshold to hit the target coverageconformal_abstention: split-conformal abstention with a coverage targetlearned_deferral: train a compact meta-model that predicts whether the base model will be correctgroupwise_thresholding: subgroup-specific thresholds as a stronger reference baseline
Practical Notes
This task intentionally uses datasets that are already available offline in the repository's scikit-learn package setup, so no shared package edit is required.
If you want the exact adult / ACSIncome / COMPAS / Law School datasets from the prompt, that would require a shared package-data follow-up, which I have not implemented here.
scikit-learn/custom_selective.py [EDITABLE — lines 161–195 only]
1: """Selective prediction / deferral benchmark.
2:
3: Fixed:
4: - offline dataset loaTool Input
1{
2 "filename": "scikit-learn/custom_selective.py",
3 "end_line": 196,
4 "op": "replace",
5 "after_line": 160,
6 "start_line": 161,
7 "content": "class SelectivePolicy:\n \"\"\"Improved policy with learned correctness score and balanced thresholding.\"\"\"\n\n def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n self.target_coverage = float(target_coverage)\n self.random_state = int(random_state)\n self.threshold_: float = 0.5\n self.group_thresholds_: dict[int, float] = {}\n self.meta_model_ = None\n self.strategy_name = \"mixed_learned_deferral\"\n\n def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n p1 = probs[:, 1]\n p0 = probs[:, 0]\n max_prob = np.maximum(p0, p1)\n margin = np.abs(p1 - p0)\n eps = 1e-12\n entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n \n feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n if X is not None:\n # Prevent dimension blowup on small calibration set\n if X.shape[1] > 50:\n # Basic PCA or Random Projection-like subset without external fit\n feats.append(X[:, :50])\n else:\n feats.append(X)\n return np.column_stack(feats)\n\n def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n from sklearn.ensemble import RandomForestClassifier\n features = self._get_features(probs, groups, X)\n correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n \n self.meta_model_ = RandomForestClassifier(\n n_estimators=100,\n max_depth=5,\n min_samples_leaf=2,\n class_weight=\"balanced\",\n random_state=self.random_state,\n )\n self.meta_model_.fit(features, correct)\n \n scores = self.meta_model_.predict_proba(features)[:, 1]\n quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n self.threshold_ = float(np.quantile(scores, quantile))\n \n alpha = 0.5\n self.group_thresholds_ = {}\n for group_id in np.unique(groups):\n mask = groups == group_id\n if np.any(mask):\n group_t = float(np.quantile(scores[mask], quantile))\n self.group_thresholds_[int(group_id)] = alpha * self.threshold_ + (1 - alpha) * group_t\n \n return self\n\n def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n if self.meta_model_ is None:\n return np.max(probs, axis=1)\n features = self._get_features(probs, groups, X)\n return self.meta_model_.predict_proba(features)[:, 1]\n\n def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n scores = self.acceptance_score(probs, groups, X)\n thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n return scores >= thresholds\n\n def calibration_summary(self) -> dict[str, float]:\n summary = {\"threshold\": float(self.threshold_)}\n for group_id, threshold in self.group_thresholds_.items():\n summary[f\"threshold_group_{group_id}\"] = float(threshold)\n return summary\n"
8}ERROR: Lines 161..196 exceed the editable range. You may only edit lines 161–195.
[Current file: scikit-learn/custom_selective.py | editable: 161–195 | total: 339 lines]
161 class SelectivePolicy:
162 """Policy that maps calibration outputs to accept / defer decisions.
163
... (29 more lines) ...
193 return {
194 "threshold": float(self.threshold_),
195 }Tool Input
1{
2 "content": "class SelectivePolicy:\n \"\"\"Improved policy with learned correctness score and balanced thresholding.\"\"\"\n\n def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n self.target_coverage = float(target_coverage)\n self.random_state = int(random_state)\n self.threshold_: float = 0.5\n self.group_thresholds_: dict[int, float] = {}\n self.meta_model_ = None\n self.strategy_name = \"mixed_learned_deferral\"\n\n def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n p1 = probs[:, 1]\n p0 = probs[:, 0]\n max_prob = np.maximum(p0, p1)\n margin = np.abs(p1 - p0)\n eps = 1e-12\n entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n \n feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n if X is not None:\n if X.shape[1] > 50:\n feats.append(X[:, :50])\n else:\n feats.append(X)\n return np.column_stack(feats)\n\n def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n from sklearn.ensemble import RandomForestClassifier\n features = self._get_features(probs, groups, X)\n correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n \n self.meta_model_ = RandomForestClassifier(\n n_estimators=100,\n max_depth=5,\n min_samples_leaf=2,\n class_weight=\"balanced\",\n random_state=self.random_state,\n )\n self.meta_model_.fit(features, correct)\n \n scores = self.meta_model_.predict_proba(features)[:, 1]\n quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n self.threshold_ = float(np.quantile(scores, quantile))\n \n alpha = 0.5\n self.group_thresholds_ = {}\n for group_id in np.unique(groups):\n mask = groups == group_id\n if np.any(mask):\n group_t = float(np.quantile(scores[mask], quantile))\n self.group_thresholds_[int(group_id)] = alpha * self.threshold_ + (1 - alpha) * group_t\n \n return self\n\n def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n if self.meta_model_ is None:\n return np.max(probs, axis=1)\n features = self._get_features(probs, groups, X)\n return self.meta_model_.predict_proba(features)[:, 1]\n\n def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n scores = self.acceptance_score(probs, groups, X)\n thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n return scores >= thresholds\n\n def calibration_summary(self) -> dict[str, float]:\n summary = {\"threshold\": float(self.threshold_)}\n for group_id, threshold in self.group_thresholds_.items():\n summary[f\"threshold_group_{group_id}\"] = float(threshold)\n return summary",
3 "end_line": 195,
4 "filename": "scikit-learn/custom_selective.py",
5 "op": "replace",
6 "after_line": 160,
7 "start_line": 161
8}OK: Replaced lines 161..195 with 71 line(s) in scikit-learn/custom_selective.py. Editable range: 161–231.
[Current file: scikit-learn/custom_selective.py | editable: 161–231 | total: 375 lines]
161 class SelectivePolicy:
162 """Improved policy with learned correctness score and balanced thresholding."""
163
... (65 more lines) ...
229 for group_id, threshold in self.group_thresholds_.items():
230 summary[f"threshold_group_{group_id}"] = float(threshold)
231 return summary[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final) ### breast_cancer (scripts/breast_cancer.sh) Training progress: TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.980706 Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.719298, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.117647, auroc_breast_cancer=0.947727, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.719298, group_threshold_breast_cancer=13.270000 ### diabetes (scripts/diabetes.sh) Training progress: TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.383176 Final metrics (diabetes): selective_risk_at80_diabetes=0.209877, coverage_at80_diabetes=0.910112, worst_group_selective_risk_diabetes=0.230769, deferral_rate_gap_diabetes=0.076220, auroc_diabetes=0.759104, targe
Tool Input
1{
2 "content": "class SelectivePolicy:\n \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n self.target_coverage = float(target_coverage)\n self.random_state = int(random_state)\n self.threshold_: float = 0.5\n self.group_thresholds_: dict[int, float] = {}\n self.meta_model_ = None\n self.strategy_name = \"unified_balanced_deferral\"\n\n def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n p1 = probs[:, 1]\n p0 = probs[:, 0]\n max_prob = np.maximum(p0, p1)\n margin = np.abs(p1 - p0)\n eps = 1e-12\n entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n \n feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n if X is not None:\n feats.append(X)\n return np.column_stack(feats)\n\n def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n from sklearn.ensemble import HistGradientBoostingClassifier\n \n features = self._get_features(probs, groups, X)\n correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n \n # Calculate sample weights to balance classes\n classes, counts = np.unique(correct, return_counts=True)\n # Avoid division by zero if all classes are the same\n class_weight = {c: len(correct) / (len(classes) * count) if count > 0 else 1.0 for c, count in zip(classes, counts)}\n sample_weight = np.array([class_weight[c] for c in correct])\n \n self.meta_model_ = HistGradientBoostingClassifier(\n max_iter=150,\n max_depth=5,\n l2_regularization=0.5,\n learning_rate=0.1,\n random_state=self.random_state,\n )\n self.meta_model_.fit(features, correct, sample_weight=sample_weight)\n \n scores = self.meta_model_.predict_proba(features)[:, 1]\n # Tie breaker using the base classifier's confidence\n scores = scores + 1e-9 * np.max(probs, axis=1)\n \n quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n self.threshold_ = float(np.quantile(scores, quantile))\n \n self.group_thresholds_ = {}\n # Alpha parameter: 0.0 means global threshold (high coverage gap, low risk gap)\n # 1.0 means exactly equal coverage (0 gap, higher risk gap)\n alpha = 0.6\n \n for group_id in np.unique(groups):\n mask = groups == group_id\n if not np.any(mask):\n continue\n \n group_scores = scores[mask]\n cov_global = float(np.mean(group_scores >= self.threshold_))\n \n # Interpolate the TARGET COVERAGE for this group\n target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n \n # Find the local threshold that hits target_cov_g\n q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n \n return self\n\n def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n if self.meta_model_ is None:\n return np.max(probs, axis=1)\n features = self._get_features(probs, groups, X)\n scores = self.meta_model_.predict_proba(features)[:, 1]\n scores = scores + 1e-9 * np.max(probs, axis=1)\n return scores\n\n def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n scores = self.acceptance_score(probs, groups, X)\n thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n return scores >= thresholds\n\n def calibration_summary(self) -> dict[str, float]:\n summary = {\"threshold\": float(self.threshold_)}\n for group_id, threshold in self.group_thresholds_.items():\n summary[f\"threshold_group_{group_id}\"] = float(threshold)\n return summary",
3 "end_line": 231,
4 "start_line": 161,
5 "filename": "scikit-learn/custom_selective.py",
6 "op": "replace",
7 "after_line": 160
8}OK: Replaced lines 161..231 with 92 line(s) in scikit-learn/custom_selective.py. Editable range: 161–252.
[Current file: scikit-learn/custom_selective.py | editable: 161–252 | total: 396 lines]
161 class SelectivePolicy:
162 """Robust learned deferral balancing coverage gap and selective risk."""
163
... (86 more lines) ...
250 for group_id, threshold in self.group_thresholds_.items():
251 summary[f"threshold_group_{group_id}"] = float(threshold)
252 return summary[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final) ### breast_cancer (scripts/breast_cancer.sh) Training progress: TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.986469 Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.012658, coverage_at80_breast_cancer=0.692982, worst_group_selective_risk_breast_cancer=0.023256, deferral_rate_gap_breast_cancer=0.023343, auroc_breast_cancer=0.913636, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.692982, group_threshold_breast_cancer=13.270000 ### diabetes (scripts/diabetes.sh) Training progress: TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.256343 Final metrics (diabetes): selective_risk_at80_diabetes=0.166667, coverage_at80_diabetes=0.808989, worst_group_selective_risk_diabetes=0.187500, deferral_rate_gap_diabetes=0.052846, auroc_diabetes=0.751401, target
Tool Input
1{
2 "op": "replace",
3 "end_line": 252,
4 "content": "class SelectivePolicy:\n \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n self.target_coverage = float(target_coverage)\n self.random_state = int(random_state)\n self.threshold_: float = 0.5\n self.group_thresholds_: dict[int, float] = {}\n self.meta_model_ = None\n self.strategy_name = \"unified_balanced_deferral\"\n\n def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n p1 = probs[:, 1]\n p0 = probs[:, 0]\n max_prob = np.maximum(p0, p1)\n margin = np.abs(p1 - p0)\n eps = 1e-12\n entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n \n feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n if X is not None:\n feats.append(X)\n return np.column_stack(feats)\n\n def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n from sklearn.compose import ColumnTransformer\n from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n from sklearn.pipeline import Pipeline\n from sklearn.linear_model import LogisticRegression\n \n features = self._get_features(probs, groups, X)\n correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n \n conf_indices = [0, 1, 2, 3, 4] # p1, max_prob, margin, entropy, group\n x_indices = list(range(5, features.shape[1]))\n \n transformers = [\n ('conf', Pipeline([\n ('scale', StandardScaler()),\n ('poly', PolynomialFeatures(degree=2, include_bias=False))\n ]), conf_indices)\n ]\n if len(x_indices) > 0:\n transformers.append(('x', StandardScaler(), x_indices))\n \n preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')\n \n self.meta_model_ = Pipeline([\n ('pre', preprocessor),\n ('clf', LogisticRegression(\n C=0.1,\n class_weight='balanced',\n solver='lbfgs',\n max_iter=1000,\n random_state=self.random_state\n ))\n ])\n \n self.meta_model_.fit(features, correct)\n scores = self.acceptance_score(probs, groups, X)\n \n quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n self.threshold_ = float(np.quantile(scores, quantile))\n \n self.group_thresholds_ = {}\n alpha = 0.70 # Force gap smaller\n \n for group_id in np.unique(groups):\n mask = groups == group_id\n if not np.any(mask):\n continue\n \n group_scores = scores[mask]\n cov_global = float(np.mean(group_scores >= self.threshold_))\n target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n \n q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n \n return self\n\n def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n if self.meta_model_ is None:\n return np.max(probs, axis=1)\n features = self._get_features(probs, groups, X)\n scores = self.meta_model_.predict_proba(features)[:, 1]\n scores = scores + 1e-9 * np.max(probs, axis=1)\n return scores\n\n def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n scores = self.acceptance_score(probs, groups, X)\n thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n return scores >= thresholds\n\n def calibration_summary(self) -> dict[str, float]:\n summary = {\"threshold\": float(self.threshold_)}\n for group_id, threshold in self.group_thresholds_.items():\n summary[f\"threshold_group_{group_id}\"] = float(threshold)\n return summary",
5 "after_line": 160,
6 "start_line": 161,
7 "filename": "scikit-learn/custom_selective.py"
8}OK: Replaced lines 161..252 with 99 line(s) in scikit-learn/custom_selective.py. Editable range: 161–259.
[Current file: scikit-learn/custom_selective.py | editable: 161–259 | total: 403 lines]
161 class SelectivePolicy:
162 """Robust learned deferral balancing coverage gap and selective risk."""
163
... (93 more lines) ...
257 for group_id, threshold in self.group_thresholds_.items():
258 summary[f"threshold_group_{group_id}"] = float(threshold)
259 return summary[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final) [NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer. ### breast_cancer (scripts/breast_cancer.sh) Training progress: TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.983487 Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.736842, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.126984, auroc_breast_cancer=0.968182, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.736842, group_threshold_breast_cancer=13.270000 ### diabetes (scripts/diabetes.sh) Training progress: TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.322932 Final metrics (diabetes): selective_risk_at80_diabetes=0.160000, coverage_at80_diabetes=
Tool Input
1{
2 "op": "replace",
3 "end_line": 259,
4 "start_line": 161,
5 "content": "class SelectivePolicy:\n \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n self.target_coverage = float(target_coverage)\n self.random_state = int(random_state)\n self.threshold_: float = 0.5\n self.group_thresholds_: dict[int, float] = {}\n self.meta_model_ = None\n self.strategy_name = \"unified_balanced_deferral\"\n\n def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n p1 = probs[:, 1]\n p0 = probs[:, 0]\n max_prob = np.maximum(p0, p1)\n margin = np.abs(p1 - p0)\n eps = 1e-12\n entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n \n feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n if X is not None:\n feats.append(X)\n return np.column_stack(feats)\n\n def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n from sklearn.compose import ColumnTransformer\n from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n from sklearn.pipeline import Pipeline\n from sklearn.linear_model import LogisticRegression\n from sklearn.kernel_approximation import Nystroem\n \n features = self._get_features(probs, groups, X)\n correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n \n conf_indices = [0, 1, 2, 3, 4]\n x_indices = list(range(5, features.shape[1]))\n \n transformers = [\n ('conf', Pipeline([\n ('scale', StandardScaler()),\n ('poly', PolynomialFeatures(degree=2, include_bias=False))\n ]), conf_indices)\n ]\n if len(x_indices) > 0:\n transformers.append(('x', Pipeline([\n ('scale', StandardScaler()),\n # Adds non-linear capacity uniformly across datasets\n ('nys', Nystroem(n_components=100, random_state=self.random_state))\n ]), x_indices))\n \n preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')\n \n self.meta_model_ = Pipeline([\n ('pre', preprocessor),\n ('clf', LogisticRegression(\n C=0.1,\n class_weight='balanced',\n solver='lbfgs',\n max_iter=1000,\n random_state=self.random_state\n ))\n ])\n \n self.meta_model_.fit(features, correct)\n scores = self.acceptance_score(probs, groups, X)\n \n quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n self.threshold_ = float(np.quantile(scores, quantile))\n \n self.group_thresholds_ = {}\n alpha = 0.70\n \n for group_id in np.unique(groups):\n mask = groups == group_id\n if not np.any(mask):\n continue\n \n group_scores = scores[mask]\n cov_global = float(np.mean(group_scores >= self.threshold_))\n target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n \n q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n \n return self\n\n def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n if self.meta_model_ is None:\n return np.max(probs, axis=1)\n features = self._get_features(probs, groups, X)\n scores = self.meta_model_.predict_proba(features)[:, 1]\n scores = scores + 1e-9 * np.max(probs, axis=1)\n return scores\n\n def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n scores = self.acceptance_score(probs, groups, X)\n thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n return scores >= thresholds\n\n def calibration_summary(self) -> dict[str, float]:\n summary = {\"threshold\": float(self.threshold_)}\n for group_id, threshold in self.group_thresholds_.items():\n summary[f\"threshold_group_{group_id}\"] = float(threshold)\n return summary",
6 "filename": "scikit-learn/custom_selective.py",
7 "after_line": 160
8}OK: Replaced lines 161..259 with 104 line(s) in scikit-learn/custom_selective.py. Editable range: 161–264.
[Current file: scikit-learn/custom_selective.py | editable: 161–264 | total: 408 lines]
161 class SelectivePolicy:
162 """Robust learned deferral balancing coverage gap and selective risk."""
163
... (98 more lines) ...
262 for group_id, threshold in self.group_thresholds_.items():
263 summary[f"threshold_group_{group_id}"] = float(threshold)
264 return summaryERROR: Test budget exhausted (3/3). You MUST call submit(n=N) to choose which test result to submit as final.