Agent Conversation: google/gemini-3.1-pro-preview

Model: google/gemini-3.1-pro-preview

Total steps: 11

Messages: 24

UserStep 0

Task: ml-selective-deferral

Selective Deferral Under Subgroup Shift

Research Question

Design a practical selective prediction and deferral policy for high-stakes tabular decisions.

The task isolates one modular question: given a fixed base classifier, what acceptance / deferral rule best trades off selective risk, subgroup fairness, and overall discrimination?

Background

Selective prediction systems should be able to say "I do not know" when the classifier is uncertain. In high-stakes settings, that deferral can be handed to a human reviewer or a slower backup process. The benchmark studies whether a policy can:

keep selective risk low at a fixed target coverage,
avoid concentrating deferrals on one subgroup,
preserve AUROC as a confidence ranking signal, and
remain simple enough to run offline on modest compute.

Task

Modify the SelectivePolicy class in custom_selective.py. The rest of the pipeline is fixed: dataset loading, train / calibration / test splitting, base model training, and metric computation.

The policy receives calibration-time base-model probabilities and subgroup labels, then decides whether each test example should be accepted or deferred. You may implement a single global threshold, a learned deferral score, subgroup-specific thresholds, or any other compact policy that fits the interface.

Evaluation

The benchmark runs on four offline tabular proxies from scikit-learn:

breast_cancer
diabetes (binarized around the training-set median)
california (binarized around the training-set median)
madelon

Each dataset is split into train / calibration / test partitions. Subgroups are formed from a stable feature threshold so that worst-group behavior can be measured.

Metrics:

selective_risk_at80: classification error on accepted examples at 80% target coverage
worst_group_selective_risk: worst subgroup error on accepted examples
deferral_rate_gap: max subgroup deferral rate minus min subgroup deferral rate
auroc: AUROC of the acceptance score for predicting correctness

Baselines

confidence_thresholding: tune one confidence threshold to hit the target coverage
conformal_abstention: split-conformal abstention with a coverage target
learned_deferral: train a compact meta-model that predicts whether the base model will be correct
groupwise_thresholding: subgroup-specific thresholds as a stronger reference baseline

Practical Notes

This task intentionally uses datasets that are already available offline in the repository's scikit-learn package setup, so no shared package edit is required.

If you want the exact adult / ACSIncome / COMPAS / Law School datasets from the prompt, that would require a shared package-data follow-up, which I have not implemented here.

scikit-learn/custom_selective.py [EDITABLE — lines 161–195 only]

     1: """Selective prediction / deferral benchmark.
     2: 
     3: Fixed:
     4: - offline dataset loa

AssistantStep 1edit

Tool Input

1{
2  "filename": "scikit-learn/custom_selective.py",
3  "end_line": 196,
4  "op": "replace",
5  "after_line": 160,
6  "start_line": 161,
7  "content": "class SelectivePolicy:\n    \"\"\"Improved policy with learned correctness score and balanced thresholding.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"mixed_learned_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            # Prevent dimension blowup on small calibration set\n            if X.shape[1] > 50:\n                # Basic PCA or Random Projection-like subset without external fit\n                feats.append(X[:, :50])\n            else:\n                feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.ensemble import RandomForestClassifier\n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        self.meta_model_ = RandomForestClassifier(\n            n_estimators=100,\n            max_depth=5,\n            min_samples_leaf=2,\n            class_weight=\"balanced\",\n            random_state=self.random_state,\n        )\n        self.meta_model_.fit(features, correct)\n        \n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        alpha = 0.5\n        self.group_thresholds_ = {}\n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if np.any(mask):\n                group_t = float(np.quantile(scores[mask], quantile))\n                self.group_thresholds_[int(group_id)] = alpha * self.threshold_ + (1 - alpha) * group_t\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        return self.meta_model_.predict_proba(features)[:, 1]\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary\n"
8}

Tool ResultStep 1

ERROR: Lines 161..196 exceed the editable range. You may only edit lines 161–195.

[Current file: scikit-learn/custom_selective.py | editable: 161–195 | total: 339 lines]
   161  class SelectivePolicy:
   162      """Policy that maps calibration outputs to accept / defer decisions.
   163  
       ... (29 more lines) ...
   193          return {
   194              "threshold": float(self.threshold_),
   195          }

AssistantStep 2edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Improved policy with learned correctness score and balanced thresholding.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"mixed_learned_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            if X.shape[1] > 50:\n                feats.append(X[:, :50])\n            else:\n                feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.ensemble import RandomForestClassifier\n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        self.meta_model_ = RandomForestClassifier(\n            n_estimators=100,\n            max_depth=5,\n            min_samples_leaf=2,\n            class_weight=\"balanced\",\n            random_state=self.random_state,\n        )\n        self.meta_model_.fit(features, correct)\n        \n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        alpha = 0.5\n        self.group_thresholds_ = {}\n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if np.any(mask):\n                group_t = float(np.quantile(scores[mask], quantile))\n                self.group_thresholds_[int(group_id)] = alpha * self.threshold_ + (1 - alpha) * group_t\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        return self.meta_model_.predict_proba(features)[:, 1]\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 195,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "after_line": 160,
7  "start_line": 161
8}

Tool ResultStep 2

OK: Replaced lines 161..195 with 71 line(s) in scikit-learn/custom_selective.py. Editable range: 161–231.

[Current file: scikit-learn/custom_selective.py | editable: 161–231 | total: 375 lines]
   161  class SelectivePolicy:
   162      """Improved policy with learned correctness score and balanced thresholding."""
   163  
       ... (65 more lines) ...
   229          for group_id, threshold in self.group_thresholds_.items():
   230              summary[f"threshold_group_{group_id}"] = float(threshold)
   231          return summary

AssistantStep 3test

Tool ResultStep 3

[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.980706
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.719298, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.117647, auroc_breast_cancer=0.947727, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.719298, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.383176
Final metrics (diabetes): selective_risk_at80_diabetes=0.209877, coverage_at80_diabetes=0.910112, worst_group_selective_risk_diabetes=0.230769, deferral_rate_gap_diabetes=0.076220, auroc_diabetes=0.759104, targe

AssistantStep 4edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"unified_balanced_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.ensemble import HistGradientBoostingClassifier\n        \n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        # Calculate sample weights to balance classes\n        classes, counts = np.unique(correct, return_counts=True)\n        # Avoid division by zero if all classes are the same\n        class_weight = {c: len(correct) / (len(classes) * count) if count > 0 else 1.0 for c, count in zip(classes, counts)}\n        sample_weight = np.array([class_weight[c] for c in correct])\n        \n        self.meta_model_ = HistGradientBoostingClassifier(\n            max_iter=150,\n            max_depth=5,\n            l2_regularization=0.5,\n            learning_rate=0.1,\n            random_state=self.random_state,\n        )\n        self.meta_model_.fit(features, correct, sample_weight=sample_weight)\n        \n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        # Tie breaker using the base classifier's confidence\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        \n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        self.group_thresholds_ = {}\n        # Alpha parameter: 0.0 means global threshold (high coverage gap, low risk gap)\n        #                  1.0 means exactly equal coverage (0 gap, higher risk gap)\n        alpha = 0.6\n        \n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if not np.any(mask):\n                continue\n            \n            group_scores = scores[mask]\n            cov_global = float(np.mean(group_scores >= self.threshold_))\n            \n            # Interpolate the TARGET COVERAGE for this group\n            target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n            \n            # Find the local threshold that hits target_cov_g\n            q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n            self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        return scores\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 231,
4  "start_line": 161,
5  "filename": "scikit-learn/custom_selective.py",
6  "op": "replace",
7  "after_line": 160
8}

Tool ResultStep 4

OK: Replaced lines 161..231 with 92 line(s) in scikit-learn/custom_selective.py. Editable range: 161–252.

[Current file: scikit-learn/custom_selective.py | editable: 161–252 | total: 396 lines]
   161  class SelectivePolicy:
   162      """Robust learned deferral balancing coverage gap and selective risk."""
   163  
       ... (86 more lines) ...
   250          for group_id, threshold in self.group_thresholds_.items():
   251              summary[f"threshold_group_{group_id}"] = float(threshold)
   252          return summary

AssistantStep 5test

Tool ResultStep 5

[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.986469
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.012658, coverage_at80_breast_cancer=0.692982, worst_group_selective_risk_breast_cancer=0.023256, deferral_rate_gap_breast_cancer=0.023343, auroc_breast_cancer=0.913636, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.692982, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.256343
Final metrics (diabetes): selective_risk_at80_diabetes=0.166667, coverage_at80_diabetes=0.808989, worst_group_selective_risk_diabetes=0.187500, deferral_rate_gap_diabetes=0.052846, auroc_diabetes=0.751401, target

AssistantStep 6edit

Tool Input

1{
2  "op": "replace",
3  "end_line": 252,
4  "content": "class SelectivePolicy:\n    \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"unified_balanced_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.compose import ColumnTransformer\n        from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n        from sklearn.pipeline import Pipeline\n        from sklearn.linear_model import LogisticRegression\n        \n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        conf_indices = [0, 1, 2, 3, 4] # p1, max_prob, margin, entropy, group\n        x_indices = list(range(5, features.shape[1]))\n        \n        transformers = [\n            ('conf', Pipeline([\n                ('scale', StandardScaler()),\n                ('poly', PolynomialFeatures(degree=2, include_bias=False))\n            ]), conf_indices)\n        ]\n        if len(x_indices) > 0:\n            transformers.append(('x', StandardScaler(), x_indices))\n            \n        preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')\n        \n        self.meta_model_ = Pipeline([\n            ('pre', preprocessor),\n            ('clf', LogisticRegression(\n                C=0.1,\n                class_weight='balanced',\n                solver='lbfgs',\n                max_iter=1000,\n                random_state=self.random_state\n            ))\n        ])\n        \n        self.meta_model_.fit(features, correct)\n        scores = self.acceptance_score(probs, groups, X)\n        \n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        self.group_thresholds_ = {}\n        alpha = 0.70  # Force gap smaller\n        \n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if not np.any(mask):\n                continue\n            \n            group_scores = scores[mask]\n            cov_global = float(np.mean(group_scores >= self.threshold_))\n            target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n            \n            q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n            self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        return scores\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
5  "after_line": 160,
6  "start_line": 161,
7  "filename": "scikit-learn/custom_selective.py"
8}

Tool ResultStep 6

OK: Replaced lines 161..252 with 99 line(s) in scikit-learn/custom_selective.py. Editable range: 161–259.

[Current file: scikit-learn/custom_selective.py | editable: 161–259 | total: 403 lines]
   161  class SelectivePolicy:
   162      """Robust learned deferral balancing coverage gap and selective risk."""
   163  
       ... (93 more lines) ...
   257          for group_id, threshold in self.group_thresholds_.items():
   258              summary[f"threshold_group_{group_id}"] = float(threshold)
   259          return summary

AssistantStep 7test

Tool ResultStep 7

[Test #3] (0 tests remaining; call submit(n=N) to choose which test result to submit as final)

[NOTE] This was your last test. You MUST now call submit(n=X) to choose which test result to submit as your final answer.

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.798246 policy_threshold=0.983487
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.736842, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.126984, auroc_breast_cancer=0.968182, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.736842, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.786517 policy_threshold=0.322932
Final metrics (diabetes): selective_risk_at80_diabetes=0.160000, coverage_at80_diabetes=

AssistantStep 8edit

Tool Input

1{
2  "op": "replace",
3  "end_line": 259,
4  "start_line": 161,
5  "content": "class SelectivePolicy:\n    \"\"\"Robust learned deferral balancing coverage gap and selective risk.\"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"unified_balanced_deferral\"\n\n    def _get_features(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        p1 = probs[:, 1]\n        p0 = probs[:, 0]\n        max_prob = np.maximum(p0, p1)\n        margin = np.abs(p1 - p0)\n        eps = 1e-12\n        entropy = -(p0 * np.log(np.clip(p0, eps, 1.0)) + p1 * np.log(np.clip(p1, eps, 1.0)))\n        \n        feats = [p1, max_prob, margin, entropy, groups.astype(float)]\n        if X is not None:\n            feats.append(X)\n        return np.column_stack(feats)\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        from sklearn.compose import ColumnTransformer\n        from sklearn.preprocessing import PolynomialFeatures, StandardScaler\n        from sklearn.pipeline import Pipeline\n        from sklearn.linear_model import LogisticRegression\n        from sklearn.kernel_approximation import Nystroem\n        \n        features = self._get_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        \n        conf_indices = [0, 1, 2, 3, 4]\n        x_indices = list(range(5, features.shape[1]))\n        \n        transformers = [\n            ('conf', Pipeline([\n                ('scale', StandardScaler()),\n                ('poly', PolynomialFeatures(degree=2, include_bias=False))\n            ]), conf_indices)\n        ]\n        if len(x_indices) > 0:\n            transformers.append(('x', Pipeline([\n                ('scale', StandardScaler()),\n                # Adds non-linear capacity uniformly across datasets\n                ('nys', Nystroem(n_components=100, random_state=self.random_state))\n            ]), x_indices))\n            \n        preprocessor = ColumnTransformer(transformers=transformers, remainder='drop')\n        \n        self.meta_model_ = Pipeline([\n            ('pre', preprocessor),\n            ('clf', LogisticRegression(\n                C=0.1,\n                class_weight='balanced',\n                solver='lbfgs',\n                max_iter=1000,\n                random_state=self.random_state\n            ))\n        ])\n        \n        self.meta_model_.fit(features, correct)\n        scores = self.acceptance_score(probs, groups, X)\n        \n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(scores, quantile))\n        \n        self.group_thresholds_ = {}\n        alpha = 0.70\n        \n        for group_id in np.unique(groups):\n            mask = groups == group_id\n            if not np.any(mask):\n                continue\n            \n            group_scores = scores[mask]\n            cov_global = float(np.mean(group_scores >= self.threshold_))\n            target_cov_g = alpha * self.target_coverage + (1.0 - alpha) * cov_global\n            \n            q_g = float(np.clip(1.0 - target_cov_g, 0.0, 1.0))\n            self.group_thresholds_[int(group_id)] = float(np.quantile(group_scores, q_g))\n                \n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n        features = self._get_features(probs, groups, X)\n        scores = self.meta_model_.predict_proba(features)[:, 1]\n        scores = scores + 1e-9 * np.max(probs, axis=1)\n        return scores\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n        return scores >= thresholds\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
6  "filename": "scikit-learn/custom_selective.py",
7  "after_line": 160
8}

Tool ResultStep 8

OK: Replaced lines 161..259 with 104 line(s) in scikit-learn/custom_selective.py. Editable range: 161–264.

[Current file: scikit-learn/custom_selective.py | editable: 161–264 | total: 408 lines]
   161  class SelectivePolicy:
   162      """Robust learned deferral balancing coverage gap and selective risk."""
   163  
       ... (98 more lines) ...
   262          for group_id, threshold in self.group_thresholds_.items():
   263              summary[f"threshold_group_{group_id}"] = float(threshold)
   264          return summary

AssistantStep 9test

Tool ResultStep 9

ERROR: Test budget exhausted (3/3). You MUST call submit(n=N) to choose which test result to submit as final.