Agent Conversation: qwen/qwen3.6-plus

Model: qwen/qwen3.6-plus

Total steps: 13

Messages: 28

UserStep 0

Task: ml-selective-deferral

Selective Deferral Under Subgroup Shift

Research Question

Design a practical selective prediction and deferral policy for high-stakes tabular decisions.

The task isolates one modular question: given a fixed base classifier, what acceptance / deferral rule best trades off selective risk, subgroup fairness, and overall discrimination?

Background

Selective prediction systems should be able to say "I do not know" when the classifier is uncertain. In high-stakes settings, that deferral can be handed to a human reviewer or a slower backup process. The benchmark studies whether a policy can:

keep selective risk low at a fixed target coverage,
avoid concentrating deferrals on one subgroup,
preserve AUROC as a confidence ranking signal, and
remain simple enough to run offline on modest compute.

Task

Modify the SelectivePolicy class in custom_selective.py. The rest of the pipeline is fixed: dataset loading, train / calibration / test splitting, base model training, and metric computation.

The policy receives calibration-time base-model probabilities and subgroup labels, then decides whether each test example should be accepted or deferred. You may implement a single global threshold, a learned deferral score, subgroup-specific thresholds, or any other compact policy that fits the interface.

Evaluation

The benchmark runs on four offline tabular proxies from scikit-learn:

breast_cancer
diabetes (binarized around the training-set median)
california (binarized around the training-set median)
madelon

Each dataset is split into train / calibration / test partitions. Subgroups are formed from a stable feature threshold so that worst-group behavior can be measured.

Metrics:

selective_risk_at80: classification error on accepted examples at 80% target coverage
worst_group_selective_risk: worst subgroup error on accepted examples
deferral_rate_gap: max subgroup deferral rate minus min subgroup deferral rate
auroc: AUROC of the acceptance score for predicting correctness

Baselines

confidence_thresholding: tune one confidence threshold to hit the target coverage
conformal_abstention: split-conformal abstention with a coverage target
learned_deferral: train a compact meta-model that predicts whether the base model will be correct
groupwise_thresholding: subgroup-specific thresholds as a stronger reference baseline

Practical Notes

This task intentionally uses datasets that are already available offline in the repository's scikit-learn package setup, so no shared package edit is required.

If you want the exact adult / ACSIncome / COMPAS / Law School datasets from the prompt, that would require a shared package-data follow-up, which I have not implemented here.

scikit-learn/custom_selective.py [EDITABLE — lines 161–195 only]

     1: """Selective prediction / deferral benchmark.
     2: 
     3: Fixed:
     4: - offline dataset loa

AssistantStep 1edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Fairness-aware learned deferral policy.\n\n    Combines a meta-learned correctness predictor with subgroup-aware\n    threshold calibration to balance selective risk and deferral fairness.\n\n    Key innovations:\n    1. Meta-model predicts base model correctness using rich features\n    2. Subgroup-specific thresholds calibrated for equal deferral rates\n    3. Blended acceptance score that incorporates learned meta-predictions\n       with calibration-aware margin features\n    4. Fairness-regularized threshold adjustment to minimize deferral rate gap\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"fairness_aware_learned_deferral\"\n        # Regularization parameter for fairness (hardcoded)\n        self.fairness_lambda: float = 0.3\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # Step 1: Train meta-model to predict correctness\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # Step 2: Compute meta-model scores on calibration set\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        # Step 3: Compute base confidence features\n        base_scores = self.acceptance_score(probs, groups, X)\n\n        # Step 4: Compute subgroup statistics for fairness\n        unique_groups = np.unique(groups)\n        group_sizes = {}\n        group_meta_scores = {}\n        for g in unique_groups:\n            mask = groups == g\n            group_sizes[int(g)] = mask.sum()\n            group_meta_scores[int(g)] = meta_scores[mask]\n\n        # Step 5: Calibrate subgroup-specific thresholds for equal deferral rates\n        # We want each group to have similar deferral rates while maintaining coverage\n        target_deferral = 1.0 - self.target_coverage\n        n_total = len(groups)\n        n_groups = len(unique_groups)\n\n        self.group_thresholds_ = {}\n        for g in unique_groups:\n            g_idx = int(g)\n            scores_g = group_meta_scores[g_idx]\n            # Initial threshold for this group to achieve target coverage within group\n            quantile = float(np.clip(target_deferral, 0.0, 1.0))\n            thresh_g = float(np.quantile(scores_g, quantile))\n            self.group_thresholds_[g_idx] = thresh_g\n\n        # Step 6: Apply fairness regularization - adjust thresholds to reduce deferral gap\n        self._apply_fairness_adjustment(groups, meta_scores, base_scores)\n\n        # Step 7: Set global threshold as weighted average of group thresholds\n        weights = np.asarray([group_sizes[int(g)] for g in unique_groups], dtype=float)\n        weights /= weights.sum()\n        threshs = np.asarray([self.group_thresholds_[int(g)] for g in unique_groups], dtype=float)\n        self.threshold_ = float(np.dot(weights, threshs))\n\n        return self\n\n    def _apply_fairness_adjustment(self, groups: np.ndarray, meta_scores: np.ndarray, base_scores: np.ndarray) -> None:\n        \"\"\"Adjust group thresholds to minimize deferral rate disparity.\"\"\"\n        unique_groups = np.unique(groups)\n        lambda_fair = self.fairness_lambda\n\n        for _ in range(10):  # Iterative adjustment\n            # Compute current deferral rates per group\n            group_deferrals = {}\n            for g in unique_groups:\n                g_idx = int(g)\n                mask = groups == g_idx\n                if not mask.any():\n                    continue\n                accept_mask = meta_scores[mask] >= self.group_thresholds_[g_idx]\n                group_deferrals[g_idx] = 1.0 - accept_mask.mean()\n\n            if len(group_deferrals) < 2:\n                break\n\n            # Find groups with highest and lowest deferral rates\n            max_def_group = max(group_deferrals, key=group_deferrals.get)\n            min_def_group = min(group_deferrals, key=group_deferrals.get)\n\n            deferral_gap = group_deferrals[max_def_group] - group_deferrals[min_def_group]\n            if deferral_gap < 0.02:  # Already fair enough\n                break\n\n            # Adjust thresholds to reduce gap\n            step_size = lambda_fair * deferral_gap * 0.5\n\n            # Lower threshold for high-deferral group (accept more)\n            self.group_thresholds_[max_def_group] -= step_size\n            # Raise threshold for low-deferral group (accept less, defer more)\n            self.group_thresholds_[min_def_group] += step_size\n\n            # Clip thresholds to valid range [0, 1]\n            for g in self.group_thresholds_:\n                self.group_thresholds_[g] = float(np.clip(self.group_thresholds_[g], 0.0, 1.0))\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Compute blended acceptance score combining meta-model predictions and confidence.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        # Blend meta score with base confidence (meta score is primary)\n        base_scores = np.max(probs, axis=1)\n        blended = 0.7 * meta_scores + 0.3 * base_scores\n\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        # Use group-specific thresholds if available\n        if self.group_thresholds_:\n            thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n            return scores >= thresholds\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 195,
4  "filename": "LLaMA-Factory/src/llamafactory/data/custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 1

ERROR: Package 'LLaMA-Factory' is not in allowed packages

AssistantStep 2edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Fairness-aware learned deferral policy.\n\n    Combines a meta-learned correctness predictor with subgroup-aware\n    threshold calibration to balance selective risk and deferral fairness.\n\n    Key innovations:\n    1. Meta-model predicts base model correctness using rich features\n    2. Subgroup-specific thresholds calibrated for equal deferral rates\n    3. Blended acceptance score that incorporates learned meta-predictions\n       with calibration-aware margin features\n    4. Fairness-regularized threshold adjustment to minimize deferral rate gap\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"fairness_aware_learned_deferral\"\n        # Regularization parameter for fairness (hardcoded)\n        self.fairness_lambda: float = 0.3\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # Step 1: Train meta-model to predict correctness\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # Step 2: Compute meta-model scores on calibration set\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        # Step 3: Compute base confidence features\n        base_scores = self.acceptance_score(probs, groups, X)\n\n        # Step 4: Compute subgroup statistics for fairness\n        unique_groups = np.unique(groups)\n        group_sizes = {}\n        group_meta_scores = {}\n        for g in unique_groups:\n            mask = groups == g\n            group_sizes[int(g)] = mask.sum()\n            group_meta_scores[int(g)] = meta_scores[mask]\n\n        # Step 5: Calibrate subgroup-specific thresholds for equal deferral rates\n        target_deferral = 1.0 - self.target_coverage\n        n_total = len(groups)\n        n_groups = len(unique_groups)\n\n        self.group_thresholds_ = {}\n        for g in unique_groups:\n            g_idx = int(g)\n            scores_g = group_meta_scores[g_idx]\n            # Initial threshold for this group to achieve target coverage within group\n            quantile = float(np.clip(target_deferral, 0.0, 1.0))\n            thresh_g = float(np.quantile(scores_g, quantile))\n            self.group_thresholds_[g_idx] = thresh_g\n\n        # Step 6: Apply fairness regularization - adjust thresholds to reduce deferral gap\n        self._apply_fairness_adjustment(groups, meta_scores, base_scores)\n\n        # Step 7: Set global threshold as weighted average of group thresholds\n        weights = np.asarray([group_sizes[int(g)] for g in unique_groups], dtype=float)\n        weights /= weights.sum()\n        threshs = np.asarray([self.group_thresholds_[int(g)] for g in unique_groups], dtype=float)\n        self.threshold_ = float(np.dot(weights, threshs))\n\n        return self\n\n    def _apply_fairness_adjustment(self, groups: np.ndarray, meta_scores: np.ndarray, base_scores: np.ndarray) -> None:\n        \"\"\"Adjust group thresholds to minimize deferral rate disparity.\"\"\"\n        unique_groups = np.unique(groups)\n        lambda_fair = self.fairness_lambda\n\n        for _ in range(10):  # Iterative adjustment\n            # Compute current deferral rates per group\n            group_deferrals = {}\n            for g in unique_groups:\n                g_idx = int(g)\n                mask = groups == g_idx\n                if not mask.any():\n                    continue\n                accept_mask = meta_scores[mask] >= self.group_thresholds_[g_idx]\n                group_deferrals[g_idx] = 1.0 - accept_mask.mean()\n\n            if len(group_deferrals) < 2:\n                break\n\n            # Find groups with highest and lowest deferral rates\n            max_def_group = max(group_deferrals, key=group_deferrals.get)\n            min_def_group = min(group_deferrals, key=group_deferrals.get)\n\n            deferral_gap = group_deferrals[max_def_group] - group_deferrals[min_def_group]\n            if deferral_gap < 0.02:  # Already fair enough\n                break\n\n            # Adjust thresholds to reduce gap\n            step_size = lambda_fair * deferral_gap * 0.5\n\n            # Lower threshold for high-deferral group (accept more)\n            self.group_thresholds_[max_def_group] -= step_size\n            # Raise threshold for low-deferral group (accept less, defer more)\n            self.group_thresholds_[min_def_group] += step_size\n\n            # Clip thresholds to valid range [0, 1]\n            for g in self.group_thresholds_:\n                self.group_thresholds_[g] = float(np.clip(self.group_thresholds_[g], 0.0, 1.0))\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Compute blended acceptance score combining meta-model predictions and confidence.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        # Blend meta score with base confidence (meta score is primary)\n        base_scores = np.max(probs, axis=1)\n        blended = 0.7 * meta_scores + 0.3 * base_scores\n\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        # Use group-specific thresholds if available\n        if self.group_thresholds_:\n            thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n            return scores >= thresholds\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 195,
4  "filename": "custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 2

ERROR: Package 'custom_selective.py' is not in allowed packages

AssistantStep 3edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Fairness-aware learned deferral policy.\n\n    Combines a meta-learned correctness predictor with subgroup-aware\n    threshold calibration to balance selective risk and deferral fairness.\n\n    Key innovations:\n    1. Meta-model predicts base model correctness using rich features\n    2. Subgroup-specific thresholds calibrated for equal deferral rates\n    3. Blended acceptance score that incorporates learned meta-predictions\n       with calibration-aware margin features\n    4. Fairness-regularized threshold adjustment to minimize deferral rate gap\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"fairness_aware_learned_deferral\"\n        # Regularization parameter for fairness (hardcoded)\n        self.fairness_lambda: float = 0.3\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # Step 1: Train meta-model to predict correctness\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # Step 2: Compute meta-model scores on calibration set\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        # Step 3: Compute subgroup statistics for fairness\n        unique_groups = np.unique(groups)\n        group_sizes = {}\n        group_meta_scores = {}\n        for g in unique_groups:\n            mask = groups == g\n            group_sizes[int(g)] = mask.sum()\n            group_meta_scores[int(g)] = meta_scores[mask]\n\n        # Step 4: Calibrate subgroup-specific thresholds for equal deferral rates\n        target_deferral = 1.0 - self.target_coverage\n\n        self.group_thresholds_ = {}\n        for g in unique_groups:\n            g_idx = int(g)\n            scores_g = group_meta_scores[g_idx]\n            # Initial threshold for this group to achieve target coverage within group\n            quantile = float(np.clip(target_deferral, 0.0, 1.0))\n            thresh_g = float(np.quantile(scores_g, quantile))\n            self.group_thresholds_[g_idx] = thresh_g\n\n        # Step 5: Apply fairness regularization - adjust thresholds to reduce deferral gap\n        self._apply_fairness_adjustment(groups, meta_scores)\n\n        # Step 6: Set global threshold as weighted average of group thresholds\n        weights = np.asarray([group_sizes[int(g)] for g in unique_groups], dtype=float)\n        weights /= weights.sum()\n        threshs = np.asarray([self.group_thresholds_[int(g)] for g in unique_groups], dtype=float)\n        self.threshold_ = float(np.dot(weights, threshs))\n\n        return self\n\n    def _apply_fairness_adjustment(self, groups: np.ndarray, meta_scores: np.ndarray) -> None:\n        \"\"\"Adjust group thresholds to minimize deferral rate disparity.\"\"\"\n        unique_groups = np.unique(groups)\n        lambda_fair = self.fairness_lambda\n\n        for _ in range(10):  # Iterative adjustment\n            # Compute current deferral rates per group\n            group_deferrals = {}\n            for g in unique_groups:\n                g_idx = int(g)\n                mask = groups == g_idx\n                if not mask.any():\n                    continue\n                accept_mask = meta_scores[mask] >= self.group_thresholds_[g_idx]\n                group_deferrals[g_idx] = 1.0 - accept_mask.mean()\n\n            if len(group_deferrals) < 2:\n                break\n\n            # Find groups with highest and lowest deferral rates\n            max_def_group = max(group_deferrals, key=group_deferrals.get)\n            min_def_group = min(group_deferrals, key=group_deferrals.get)\n\n            deferral_gap = group_deferrals[max_def_group] - group_deferrals[min_def_group]\n            if deferral_gap < 0.02:  # Already fair enough\n                break\n\n            # Adjust thresholds to reduce gap\n            step_size = lambda_fair * deferral_gap * 0.5\n\n            # Lower threshold for high-deferral group (accept more)\n            self.group_thresholds_[max_def_group] -= step_size\n            # Raise threshold for low-deferral group (accept less, defer more)\n            self.group_thresholds_[min_def_group] += step_size\n\n            # Clip thresholds to valid range [0, 1]\n            for g in self.group_thresholds_:\n                self.group_thresholds_[g] = float(np.clip(self.group_thresholds_[g], 0.0, 1.0))\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Compute blended acceptance score combining meta-model predictions and confidence.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        # Blend meta score with base confidence (meta score is primary)\n        base_scores = np.max(probs, axis=1)\n        blended = 0.7 * meta_scores + 0.3 * base_scores\n\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        # Use group-specific thresholds if available\n        if self.group_thresholds_:\n            thresholds = np.asarray([self.group_thresholds_.get(int(group), self.threshold_) for group in groups], dtype=float)\n            return scores >= thresholds\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 195,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 3

OK: Replaced lines 161..195 with 145 line(s) in scikit-learn/custom_selective.py. Editable range: 161–305.

[Current file: scikit-learn/custom_selective.py | editable: 161–305 | total: 449 lines]
   161  class SelectivePolicy:
   162      """Fairness-aware learned deferral policy.
   163  
       ... (139 more lines) ...
   303          for group_id, threshold in self.group_thresholds_.items():
   304              summary[f"threshold_group_{group_id}"] = float(threshold)
   305          return summary

AssistantStep 4test

Tool ResultStep 4

[Test #1] (2 tests remaining; call submit(n=N) to choose which test result to submit as final)

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.622807 policy_threshold=0.876279
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.596491, worst_group_selective_risk_breast_cancer=0.000000, deferral_rate_gap_breast_cancer=0.511671, auroc_breast_cancer=0.959091, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.596491, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.921348 policy_threshold=0.316284
Final metrics (diabetes): selective_risk_at80_diabetes=0.209302, coverage_at80_diabetes=0.966292, worst_group_selective_risk_diabetes=0.250000, deferral_rate_gap_diabetes=0.017276, auroc_diabetes=0.797619, targe

AssistantStep 5edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Adaptive learned deferral with fairness-calibrated group thresholds.\n\n    A two-stage policy:\n    1. A meta-learner predicts base-model correctness using calibration features\n       (confidence, entropy, margin, subgroup, raw feature).\n    2. Group-specific acceptance thresholds are set so every subgroup shares\n       the same acceptance rate = target_coverage. This guarantees zero\n       deferral-rate gap while optimizing which examples are accepted via\n       the learned score.\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"adaptive_learned_deferral\"\n        # Hardcoded blend weights\n        self.meta_weight: float = 0.6\n        self.base_weight: float = 0.4\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # --- Stage 1: train meta-model predicting correctness ---\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # --- Stage 2: compute blended scores on calibration set ---\n        cal_scores = self.acceptance_score(probs, groups, X)\n        unique_groups = np.unique(groups)\n\n        # --- Stage 3: set per-group thresholds to equalize acceptance rate ---\n        # For each group, set threshold at quantile(1 - target_coverage)\n        # of that group's blended score. This gives every group acceptance\n        # rate ≈ target_coverage, yielding deferral_rate_gap ≈ 0.\n        self.group_thresholds_ = {}\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n\n        for g in unique_groups:\n            g_idx = int(g)\n            mask = groups == g_idx\n            if not mask.any():\n                continue\n            scores_g = cal_scores[mask]\n            self.group_thresholds_[g_idx] = float(np.quantile(scores_g, quantile))\n\n        # Global threshold: weighted average of group thresholds (for summary)\n        group_weights = np.asarray([float((groups == g).sum()) for g in unique_groups], dtype=float)\n        group_weights /= group_weights.sum()\n        group_threshs = np.asarray([self.group_thresholds_[int(g)] for g in unique_groups], dtype=float)\n        self.threshold_ = float(np.dot(group_weights, group_threshs))\n\n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Blended score: weighted combination of meta-predicted correctness and base confidence.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n\n        base_scores = np.max(probs, axis=1)\n        blended = self.meta_weight * meta_scores + self.base_weight * base_scores\n\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        if self.group_thresholds_:\n            thresholds = np.asarray(\n                [self.group_thresholds_.get(int(group), self.threshold_) for group in groups],\n                dtype=float,\n            )\n            return scores >= thresholds\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 305,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 5

OK: Replaced lines 161..305 with 98 line(s) in scikit-learn/custom_selective.py. Editable range: 161–258.

[Current file: scikit-learn/custom_selective.py | editable: 161–258 | total: 402 lines]
   161  class SelectivePolicy:
   162      """Adaptive learned deferral with fairness-calibrated group thresholds.
   163  
       ... (92 more lines) ...
   256          for group_id, threshold in self.group_thresholds_.items():
   257              summary[f"threshold_group_{group_id}"] = float(threshold)
   258          return summary

AssistantStep 6edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Adaptive learned deferral with fairness-calibrated group thresholds.\n\n    A two-stage policy:\n    1. A meta-learner predicts base-model correctness using calibration features\n       (confidence, entropy, margin, subgroup, raw feature).\n    2. Per-group acceptance thresholds are calibrated so every subgroup\n       targets the same acceptance rate = target_coverage, guaranteeing\n       near-zero deferral-rate gap.\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"adaptive_learned_deferral\"\n        self.blended_weight: float = 0.8  # weight for meta score in blended score\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # --- Stage 1: train meta-model predicting correctness ---\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # --- Stage 2: compute blended scores on calibration set ---\n        cal_scores = self.acceptance_score(probs, groups, X)\n        unique_groups = np.unique(groups)\n\n        # --- Stage 3: set per-group thresholds to equalize acceptance rate ---\n        # Each group threshold targets the same quantile of its own score distribution.\n        self.group_thresholds_ = {}\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n\n        for g in unique_groups:\n            g_idx = int(g)\n            mask = groups == g_idx\n            if not mask.any():\n                continue\n            scores_g = cal_scores[mask]\n            self.group_thresholds_[g_idx] = float(np.quantile(scores_g, quantile))\n\n        # Global threshold: weighted average of group thresholds\n        group_weights = np.asarray([float((groups == g).sum()) for g in unique_groups], dtype=float)\n        group_weights /= group_weights.sum()\n        group_threshs = np.asarray([self.group_thresholds_[int(g)] for g in unique_groups], dtype=float)\n        self.threshold_ = float(np.dot(group_weights, group_threshs))\n\n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Blended score: weighted combination of meta-predicted correctness and base confidence.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n        base_scores = np.max(probs, axis=1)\n        blended = self.blended_weight * meta_scores + (1.0 - self.blended_weight) * base_scores\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        if self.group_thresholds_:\n            thresholds = np.asarray(\n                [self.group_thresholds_.get(int(group), self.threshold_) for group in groups],\n                dtype=float,\n            )\n            return scores >= thresholds\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, threshold in self.group_thresholds_.items():\n            summary[f\"threshold_group_{group_id}\"] = float(threshold)\n        return summary",
3  "end_line": 258,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 6

OK: Replaced lines 161..258 with 91 line(s) in scikit-learn/custom_selective.py. Editable range: 161–251.

[Current file: scikit-learn/custom_selective.py | editable: 161–251 | total: 395 lines]
   161  class SelectivePolicy:
   162      """Adaptive learned deferral with fairness-calibrated group thresholds.
   163  
       ... (85 more lines) ...
   249          for group_id, threshold in self.group_thresholds_.items():
   250              summary[f"threshold_group_{group_id}"] = float(threshold)
   251          return summary

AssistantStep 7edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Learned deferral with calibrated global threshold and fairness correction.\n\n    Strategy:\n    1. Train a meta-model on calibration features to predict base-model correctness.\n    2. Form a blended acceptance score: weighted meta-score + base confidence.\n    3. Calibrate a single global threshold over all calibration data to hit\n       the target coverage.\n    4. Apply a soft fairness post-correction that slightly adjusts per-group\n       acceptance by boosting scores for examples in under-accepted groups.\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"learned_deferral_fair\"\n        self.blended_weight: float = 0.7\n        # Fairness correction: small additive boost per group based on cal-group acceptance gap\n        self._group_offsets: dict[int, float] = {}\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # --- Stage 1: train meta-model predicting correctness ---\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # --- Stage 2: compute blended scores on calibration set ---\n        cal_scores = self.acceptance_score(probs, groups, X)\n\n        # --- Stage 3: global threshold calibration ---\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(cal_scores, quantile))\n\n        # --- Stage 4: compute per-group fairness offsets ---\n        self.group_thresholds_ = {}\n        self._group_offsets = {}\n        unique_groups = np.unique(groups)\n        for g in unique_groups:\n            g_idx = int(g)\n            mask = groups == g_idx\n            if not mask.any():\n                continue\n            # Target: each group should have the same acceptance rate as overall\n            # If group acceptance > target, lower its effective scores (higher offset)\n            # If group acceptance < target, raise its effective scores (lower/negative offset)\n            group_quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n            group_thresh = float(np.quantile(cal_scores[mask], group_quantile))\n            gap = self.threshold_ - group_thresh  # positive = group needs upward boost\n            self._group_offsets[g_idx] = gap * 0.3  # soft correction, 30% strength\n            self.group_thresholds_[g_idx] = group_thresh\n\n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Compute blended acceptance score with fairness offsets applied.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n        base_scores = np.max(probs, axis=1)\n        blended = self.blended_weight * meta_scores + (1.0 - self.blended_weight) * base_scores\n\n        # Apply fairness offsets\n        offsets = np.zeros_like(blended)\n        for g, offset in self._group_offsets.items():\n            mask = groups == g\n            offsets[mask] += offset\n        blended = blended + offsets\n\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, val in self._group_offsets.items():\n            summary[f\"offset_group_{group_id}\"] = float(val)\n        return summary",
3  "end_line": 251,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 7

OK: Replaced lines 161..251 with 98 line(s) in scikit-learn/custom_selective.py. Editable range: 161–258.

[Current file: scikit-learn/custom_selective.py | editable: 161–258 | total: 402 lines]
   161  class SelectivePolicy:
   162      """Learned deferral with calibrated global threshold and fairness correction.
   163  
       ... (92 more lines) ...
   256          for group_id, val in self._group_offsets.items():
   257              summary[f"offset_group_{group_id}"] = float(val)
   258          return summary

AssistantStep 8edit

Tool Input

1{
2  "content": "class SelectivePolicy:\n    \"\"\"Learned deferral with global threshold.\n\n    Strategy:\n    1. Train a meta-model on calibration features to predict base-model correctness.\n    2. Form a blended acceptance score from the meta-model output and base confidence.\n    3. Calibrate a single global threshold on the calibration set to reach target coverage.\n    4. Group-specific offsets are computed during fit and applied at inference time\n       to reduce deferral rate disparity between subgroups.\n    \"\"\"\n\n    def __init__(self, target_coverage: float = TARGET_COVERAGE_DEFAULT, random_state: int = 0):\n        self.target_coverage = float(target_coverage)\n        self.random_state = int(random_state)\n        self.threshold_: float = 0.5\n        self.group_thresholds_: dict[int, float] = {}\n        self.meta_model_ = None\n        self.strategy_name = \"learned_fair_deferral\"\n        self.blended_weight: float = 0.8\n        self._group_offsets: dict[int, float] = {}\n\n    def fit(self, probs: np.ndarray, y_true: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> \"SelectivePolicy\":\n        # --- Train meta-model ---\n        features = _confidence_features(probs, groups, X)\n        correct = (np.argmax(probs, axis=1) == y_true).astype(int)\n        self.meta_model_ = Pipeline(\n            steps=[\n                (\"scale\", StandardScaler()),\n                (\n                    \"clf\",\n                    LogisticRegression(\n                        max_iter=1000,\n                        solver=\"lbfgs\",\n                        class_weight=\"balanced\",\n                        random_state=self.random_state,\n                    ),\n                ),\n            ]\n        )\n        self.meta_model_.fit(features, correct)\n\n        # --- Compute blended cal scores ---\n        cal_scores = self.acceptance_score(probs, groups, X)\n\n        # --- Global threshold at target coverage ---\n        quantile = float(np.clip(1.0 - self.target_coverage, 0.0, 1.0))\n        self.threshold_ = float(np.quantile(cal_scores, quantile))\n\n        # --- Per-group offsets for fairness ---\n        self._group_offsets = {}\n        self.group_thresholds_ = {}\n        unique_groups = np.unique(groups)\n        for g in unique_groups:\n            g_idx = int(g)\n            mask = groups == g_idx\n            if not mask.any():\n                continue\n            group_acceptance = float((cal_scores[mask] >= self.threshold_).mean())\n            # offset > 0 means this group needs a boost\n            # offset < 0 means this group is over-accepting\n            self._group_offsets[g_idx] = self.target_coverage - group_acceptance\n\n        return self\n\n    def acceptance_score(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        \"\"\"Blended score + fairness offsets.\"\"\"\n        if self.meta_model_ is None:\n            return np.max(probs, axis=1)\n\n        features = _confidence_features(probs, groups, X)\n        meta_scores = self.meta_model_.predict_proba(features)[:, 1]\n        base_scores = np.max(probs, axis=1)\n        blended = self.blended_weight * meta_scores + (1.0 - self.blended_weight) * base_scores\n\n        # Apply fairness offsets\n        if self._group_offsets:\n            offsets = np.zeros_like(blended)\n            for g, offset in self._group_offsets.items():\n                mask = groups == g\n                offsets[mask] += offset * 0.5  # moderate strength\n            blended = blended + offsets\n\n        return blended\n\n    def predict_accept(self, probs: np.ndarray, groups: np.ndarray, X: np.ndarray | None = None) -> np.ndarray:\n        scores = self.acceptance_score(probs, groups, X)\n        return scores >= self.threshold_\n\n    def calibration_summary(self) -> dict[str, float]:\n        summary = {\"threshold\": float(self.threshold_)}\n        for group_id, val in self._group_offsets.items():\n            summary[f\"offset_group_{group_id}\"] = float(val)\n        return summary",
3  "end_line": 258,
4  "filename": "scikit-learn/custom_selective.py",
5  "op": "replace",
6  "start_line": 161
7}

Tool ResultStep 8

OK: Replaced lines 161..258 with 93 line(s) in scikit-learn/custom_selective.py. Editable range: 161–253.

[Current file: scikit-learn/custom_selective.py | editable: 161–253 | total: 397 lines]
   161  class SelectivePolicy:
   162      """Learned deferral with global threshold.
   163  
       ... (87 more lines) ...
   251          for group_id, val in self._group_offsets.items():
   252              summary[f"offset_group_{group_id}"] = float(val)
   253          return summary

AssistantStep 9test

Tool ResultStep 9

[Test #2] (1 test remaining; call submit(n=N) to choose which test result to submit as final)

### breast_cancer (scripts/breast_cancer.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.991202 cal_accuracy=0.956140 cal_coverage=0.394737 policy_threshold=0.985633
Final metrics (breast_cancer): selective_risk_at80_breast_cancer=0.000000, coverage_at80_breast_cancer=0.412281, worst_group_selective_risk_breast_cancer=1.000000, deferral_rate_gap_breast_cancer=0.746032, auroc_breast_cancer=0.959091, target_coverage_breast_cancer=0.800000, actual_coverage_breast_cancer=0.412281, group_threshold_breast_cancer=13.270000

### diabetes (scripts/diabetes.sh)
Training progress:
TRAIN_METRICS: train_accuracy=0.750000 cal_accuracy=0.752809 cal_coverage=0.775281 policy_threshold=0.380632
Final metrics (diabetes): selective_risk_at80_diabetes=0.164384, coverage_at80_diabetes=0.820225, worst_group_selective_risk_diabetes=0.171429, deferral_rate_gap_diabetes=0.061992, auroc_diabetes=0.792717, target