Agent Conversation: google/gemini-3.1-pro-preview

Model: google/gemini-3.1-pro-preview
Total steps: 11
Messages: 24
UserStep 0

Task: ml-selective-deferral

Selective Deferral Under Subgroup Shift

Research Question

Design a practical selective prediction and deferral policy for high-stakes tabular decisions.

The task isolates one modular question: given a fixed base classifier, what acceptance / deferral rule best trades off selective risk, subgroup fairness, and overall discrimination?

Background

Selective prediction systems should be able to say "I do not know" when the classifier is uncertain. In high-stakes settings, that deferral can be handed to a human reviewer or a slower backup process. The benchmark studies whether a policy can:

  • keep selective risk low at a fixed target coverage,
  • avoid concentrating deferrals on one subgroup,
  • preserve AUROC as a confidence ranking signal, and
  • remain simple enough to run offline on modest compute.

Task

Modify the SelectivePolicy class in custom_selective.py. The rest of the pipeline is fixed: dataset loading, train / calibration / test splitting, base model training, and metric computation.

The policy receives calibration-time base-model probabilities and subgroup labels, then decides whether each test example should be accepted or deferred. You may implement a single global threshold, a learned deferral score, subgroup-specific thresholds, or any other compact policy that fits the interface.

Evaluation

The benchmark runs on four offline tabular proxies from scikit-learn:

  • breast_cancer
  • diabetes (binarized around the training-set median)
  • california (binarized around the training-set median)
  • madelon

Each dataset is split into train / calibration / test partitions. Subgroups are formed from a stable feature threshold so that worst-group behavior can be measured.

Metrics:

  • selective_risk_at80: classification error on accepted examples at 80% target coverage
  • worst_group_selective_risk: worst subgroup error on accepted examples
  • deferral_rate_gap: max subgroup deferral rate minus min subgroup deferral rate
  • auroc: AUROC of the acceptance score for predicting correctness

Baselines

  • confidence_thresholding: tune one confidence threshold to hit the target coverage
  • conformal_abstention: split-conformal abstention with a coverage target
  • learned_deferral: train a compact meta-model that predicts whether the base model will be correct
  • groupwise_thresholding: subgroup-specific thresholds as a stronger reference baseline

Practical Notes

This task intentionally uses datasets that are already available offline in the repository's scikit-learn package setup, so no shared package edit is required.

If you want the exact adult / ACSIncome / COMPAS / Law School datasets from the prompt, that would require a shared package-data follow-up, which I have not implemented here.

scikit-learn/custom_selective.py [EDITABLE — lines 161–195 only]

     1: """Selective prediction / deferral benchmark.
     2: 
     3: Fixed:
     4: - offline dataset loa
AssistantStep 1edit
Tool Input
1{
2  "filename": "scikit-learn/custom_selective.py",
3  "end_line": 196,
4  "op": "replace",
5  "after_line": 160,
6  "start_line": 161,
7  "content": "class SelectivePolicy:\n    \"\"\"Improved policy with learned correctness score and balanced thresholding.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"mixed_learned_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            # Prevent dimension blowup on small calibration set\n            if X.shape[1] > 50:\n                # Basic PCA or Random Projection-like subset without external fit\n                feats.append(X[:, :50])\n            else:\n                feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.ensemble import RandomForestClassifier\n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        self.meta_model_ = RandomForestClassifier(\n            n_estimators=100,\n            max_depth=5,\n            min_samples_leaf=2,\n            class_weight=\"balanced\",\n            random_state=self.random_state,\n        )\n        self.meta_model_.fit(features, correct)\n        \n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        alpha = 0.5\n        self.group_thresholds_ = {}\n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if np.any(mask):\n                group_t = float(np.quantile(scores[mask], quantile))\n                self.group_thresholds_[int(group_id)] = alpha * self.threshold_ + (1 - alpha) * group_t\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        return self.meta_model_.predict_proba(features)[:, 1]\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary\n"
8}
Tool ResultStep 1
ERROR: Lines 161..196 exceed the editable range. You may only edit lines 161–195.

[Current file: scikit-learn/custom_selective.py | editable: 161–195 | total: 339 lines]
   161  class SelectivePolicy:
   162      """Policy that maps calibration outputs to accept / defer decisions.
   163  
       ... (29 more lines) ...
   193          return {
   194              "threshold": float(self.threshold_),
   195          }
AssistantStep 2edit
Tool Input
1{
2  "content": "class SelectivePolicy:\n    \"\"\"Improved policy with learned correctness score and balanced thresholding.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"mixed_learned_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            if X.shape[1] > 50:\n                feats.append(X[:, :50])\n            else:\n                feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.ensemble import RandomForestClassifier\n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        self.meta_model_ = RandomForestClassifier(\n            n_estimators=100,\n            max_depth=5,\n            min_samples_leaf=2,\n            class_weight=\"balanced\",\n            random_state=self.random_state,\n        )\n        self.meta_model_.fit(features, correct)\n        \n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        alpha = 0.5\n        self.group_thresholds_ = {}\n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if np.any(mask):\n                group_t = float(np.quantile(scores[mask], quantile))\n                self.group_thresholds_[int(group_id)] = alpha * self.threshold_ + (1 - alpha) * group_t\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        return self.meta_model_.predict_proba(features)[:, 1]\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 195,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "after_line": 160,
7  "start_line": 161
8}
Tool ResultStep 2
OK: Replaced lines 161..195 with 71 line(s) in scikit-learn/custom_selective.py. Editable range: 161–231.

[Current file: scikit-learn/custom_selective.py | editable: 161–231 | total: 375 lines]
   161  class SelectivePolicy:
   162      """Improved policy with learned correctness score and balanced thresholding."""
   163  
       ... (65 more lines) ...
   229          for group_id, threshold in self.group_thresholds_.items():
   230              summary[f"threshold_group_{group_id}"] = float(threshold)
   231          return summary
AssistantStep 3test
Tool ResultStep 3
[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.980706
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.719298, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.117647, auroc_breast_cancer=0.947727, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.719298, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.383176
Final metrics (diabetes): selective_risk_at80_diabetes=0.209877, coverage_at80_diabetes=0.910112, worst_group_selective_risk_diabetes=0.230769, deferral_rate_gap_diabetes=0.076220, auroc_diabetes=0.759104, targe
AssistantStep 4edit
Tool Input
1{
2  "content": "class SelectivePolicy:\n    \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"unified_balanced_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.ensemble import HistGradientBoostingClassifier\n        \n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        # Calculate sample weights to balance classes\n        classes, counts = np.unique(correct, return_counts=True)\n        # Avoid division by zero if all classes are the same\n        class_weight = {c: len(correct) / (len(classes) * count) if count > 0 else 1.0 for c, count in zip(classes, counts)}\n        sample_weight = np.array([class_weight[c] for c in correct])\n        \n        self.meta_model_ = HistGradientBoostingClassifier(\n            max_iter=150,\n            max_depth=5,\n            l2_regularization=0.5,\n            learning_rate=0.1,\n            random_state=self.random_state,\n        )\n        self.meta_model_.fit(features, correct, sample_weight=sample_weight)\n        \n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        # Tie breaker using the base classifier's confidence\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        \n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        self.group_thresholds_ = {}\n        # Alpha parameter: 0.0 means global threshold (high coverage gap, low risk gap)\n        #                  1.0 means exactly equal coverage (0 gap, higher risk gap)\n        alpha = 0.6\n        \n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if not np.any(mask):\n                continue\n            \n            group_scores = scores[mask]\n            cov_global = float(np.mean(group_scores >= self.threshold_))\n            \n            # Interpolate the TARGET COVERAGE for this group\n            target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n            \n            # Find the local threshold that hits target_cov_g\n            q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n            self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        return scores\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 231,
4  "start_line": 161,
5  "filename": "scikit-learn/custom_selective.py",
6  "op": "replace",
7  "after_line": 160
8}
Tool ResultStep 4
OK: Replaced lines 161..231 with 92 line(s) in scikit-learn/custom_selective.py. Editable range: 161–252.

[Current file: scikit-learn/custom_selective.py | editable: 161–252 | total: 396 lines]
   161  class SelectivePolicy:
   162      """Robust learned deferral balancing coverage gap and selective risk."""
   163  
       ... (86 more lines) ...
   250          for group_id, threshold in self.group_thresholds_.items():
   251              summary[f"threshold_group_{group_id}"] = float(threshold)
   252          return summary
AssistantStep 5test
Tool ResultStep 5
[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.986469
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.012658, coverage_at80_breast_cancer=0.692982, worst_group_selective_risk_breast_cancer=0.023256, deferral_rate_gap_breast_cancer=0.023343, auroc_breast_cancer=0.913636, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.692982, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.256343
Final metrics (diabetes): selective_risk_at80_diabetes=0.166667, coverage_at80_diabetes=0.808989, worst_group_selective_risk_diabetes=0.187500, deferral_rate_gap_diabetes=0.052846, auroc_diabetes=0.751401, target
AssistantStep 6edit
Tool Input
1{
2  "op": "replace",
3  "end_line": 252,
4  "content": "class SelectivePolicy:\n    \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"unified_balanced_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.compose import ColumnTransformer\n        from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n        from sklearn.pipeline import Pipeline\n        from sklearn.linear_model import LogisticRegression\n        \n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        conf_indices = [0, 1, 2, 3, 4] # p1, max_prob, margin, entropy, group\n        x_indices = list(range(5, features.shape[1]))\n        \n        transformers = [\n            ('conf', Pipeline([\n                ('scale', StandardScaler()),\n                ('poly', PolynomialFeatures(degree=2, include_bias=False))\n            ]), conf_indices)\n        ]\n        if len(x_indices) > 0:\n            transformers.append(('x', StandardScaler(), x_indices))\n            \n        preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')\n        \n        self.meta_model_ = Pipeline([\n            ('pre', preprocessor),\n            ('clf', LogisticRegression(\n                C=0.1,\n                class_weight='balanced',\n                solver='lbfgs',\n                max_iter=1000,\n                random_state=self.random_state\n            ))\n        ])\n        \n        self.meta_model_.fit(features, correct)\n        scores = self.acceptance_score(probs, groups, X)\n        \n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        self.group_thresholds_ = {}\n        alpha = 0.70  # Force gap smaller\n        \n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if not np.any(mask):\n                continue\n            \n            group_scores = scores[mask]\n            cov_global = float(np.mean(group_scores >= self.threshold_))\n            target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n            \n            q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n            self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        return scores\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
5  "after_line": 160,
6  "start_line": 161,
7  "filename": "scikit-learn/custom_selective.py"
8}
Tool ResultStep 6
OK: Replaced lines 161..252 with 99 line(s) in scikit-learn/custom_selective.py. Editable range: 161–259.

[Current file: scikit-learn/custom_selective.py | editable: 161–259 | total: 403 lines]
   161  class SelectivePolicy:
   162      """Robust learned deferral balancing coverage gap and selective risk."""
   163  
       ... (93 more lines) ...
   257          for group_id, threshold in self.group_thresholds_.items():
   258              summary[f"threshold_group_{group_id}"] = float(threshold)
   259          return summary
AssistantStep 7test
Tool ResultStep 7
[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final)

[NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer.

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.983487
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.736842, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.126984, auroc_breast_cancer=0.968182, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.736842, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.322932
Final metrics (diabetes): selective_risk_at80_diabetes=0.160000, coverage_at80_diabetes=
AssistantStep 8edit
Tool Input
1{
2  "op": "replace",
3  "end_line": 259,
4  "start_line": 161,
5  "content": "class SelectivePolicy:\n    \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"unified_balanced_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.compose import ColumnTransformer\n        from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n        from sklearn.pipeline import Pipeline\n        from sklearn.linear_model import LogisticRegression\n        from sklearn.kernel_approximation import Nystroem\n        \n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        conf_indices = [0, 1, 2, 3, 4]\n        x_indices = list(range(5, features.shape[1]))\n        \n        transformers = [\n            ('conf', Pipeline([\n                ('scale', StandardScaler()),\n                ('poly', PolynomialFeatures(degree=2, include_bias=False))\n            ]), conf_indices)\n        ]\n        if len(x_indices) > 0:\n            transformers.append(('x', Pipeline([\n                ('scale', StandardScaler()),\n                # Adds non-linear capacity uniformly across datasets\n                ('nys', Nystroem(n_components=100, random_state=self.random_state))\n            ]), x_indices))\n            \n        preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')\n        \n        self.meta_model_ = Pipeline([\n            ('pre', preprocessor),\n            ('clf', LogisticRegression(\n                C=0.1,\n                class_weight='balanced',\n                solver='lbfgs',\n                max_iter=1000,\n                random_state=self.random_state\n            ))\n        ])\n        \n        self.meta_model_.fit(features, correct)\n        scores = self.acceptance_score(probs, groups, X)\n        \n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        self.group_thresholds_ = {}\n        alpha = 0.70\n        \n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if not np.any(mask):\n                continue\n            \n            group_scores = scores[mask]\n            cov_global = float(np.mean(group_scores >= self.threshold_))\n            target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n            \n            q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n            self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        return scores\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
6  "filename": "scikit-learn/custom_selective.py",
7  "after_line": 160
8}
Tool ResultStep 8
OK: Replaced lines 161..259 with 104 line(s) in scikit-learn/custom_selective.py. Editable range: 161–264.

[Current file: scikit-learn/custom_selective.py | editable: 161–264 | total: 408 lines]
   161  class SelectivePolicy:
   162      """Robust learned deferral balancing coverage gap and selective risk."""
   163  
       ... (98 more lines) ...
   262          for group_id, threshold in self.group_thresholds_.items():
   263              summary[f"threshold_group_{group_id}"] = float(threshold)
   264          return summary
AssistantStep 9test
Tool ResultStep 9
ERROR: Test budget exhausted (3/3). You MUST call submit(n=N) to choose which test result to submit as final.