Agent Conversation: google/gemini-3.1-pro-preview
Task: optimization-online-bandit
Online Bandits: Exploration-Exploitation Strategy Design
Objective
Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. Your code goes in custom_bandit.py. Three reference implementations (UCB1, Thompson Sampling, KL-UCB) are available as read-only in the SMPyBandits package.
Background
The multi-armed bandit problem is a fundamental model for the exploration-exploitation tradeoff in sequential decision-making. At each round, an agent selects one of K arms and observes a stochastic reward. The goal is to minimize cumulative regret -- the gap between the reward of the best arm (in hindsight) and the agent's actual reward.
Classic algorithms include:
- UCB1 (Auer et al., 2002): plays the arm with the highest upper confidence bound on its mean reward, achieving O(sqrt(KT log T)) minimax regret.
- Thompson Sampling (Thompson 1933; Agrawal & Goyal 2012): samples from a Bayesian posterior and plays the arm with the highest sample, achieving optimal Bayesian regret.
- KL-UCB (Garivier & Cappe 2011; Cappe et al. 2013): uses Kullback-Leibler divergence for tighter confidence bounds, provably optimal for Bernoulli bandits.
Key challenges include adapting to different reward distributions, handling contextual information, and detecting non-stationarity.
Task
Modify the BanditPolicy class in custom_bandit.py (the EDITABLE section). You must implement:
__init__(K, context_dim): initialize your policy for K arms with optional contextselect_arm(t, context): choose which arm to pull at timestep tupdate(arm, reward, context): update internal state after observing a rewardreset(): reset state for a new run
Interface
class BanditPolicy:
def __init__(self, K: int, context_dim: int = 0): ...
def reset(self): ...
def select_arm(self, t: int, context: np.ndarray | None = None) -> int: ...
def update(self, arm: int, reward: float, context: np.ndarray | None = None): ...
Available utilities (in the FIXED section):
kl_bernoulli(p, q): KL divergence between Bernoulli distributionskl_ucb_bound(mu_hat, n, t, c): computes the KL-UCB upper confidence bound
Evaluation
Evaluated on three bandit settings (lower regret is better):
- Stochastic MAB: 10-armed Bernoulli bandit, T=10,000 rounds. Arms have fixed reward probabilities.
- Contextual: 5-armed linear contextual bandit with d=10 features, T=10,000 rounds. Expected reward is a linear function of the context.
- Non-stationary: 5-armed piece-wise stationary Bernoulli bandit with 4 abrupt changepoints, T=10,000 rounds. The best arm changes over time.
Metric: normalized cumulative regret = (cumulative regret) / T.
SMPyBandits/custom_bandit.py [EDITABLE — lines 261–321 only]
1: # Custom online bandit algorithm for MLS-Bench
2: #
3: # EDITABLE section: BanditPolicy class — the exploration-Tool Input
1{
2 "end_line": 321,
3 "filename": "custom_bandit.py",
4 "content": "class BanditPolicy:\n def __init__(self, K: int, context_dim: int = 0):\n self.K = K\n self.context_dim = context_dim\n self.rng = np.random.default_rng(42)\n \n if self.context_dim > 0:\n self._lambda = 0.4\n self._v2 = 0.01\n self._B_inv = np.array([np.eye(self.context_dim) / self._lambda for _ in range(self.K)])\n self._f = np.zeros((self.K, self.context_dim), dtype=np.float64)\n self._theta_hat = np.zeros((self.K, self.context_dim), dtype=np.float64)\n else:\n self.reset_mab()\n\n def reset_mab(self):\n self.era_t = 0\n self.history = [[] for _ in range(self.K)]\n\n def reset(self):\n if self.context_dim > 0:\n self._B_inv = np.array([np.eye(self.context_dim) / self._lambda for _ in range(self.K)])\n self._f[:] = 0.0\n self._theta_hat[:] = 0.0\n else:\n self.reset_mab()\n\n def select_arm(self, t: int, context: np.ndarray | None = None) -> int:\n if self.context_dim > 0 and context is not None:\n best_arm = 0\n best_val = -np.inf\n for a in range(self.K):\n mu_a = self._theta_hat[a]\n cov_a = self._v2 * self._B_inv[a]\n eigvals, eigvecs = np.linalg.eigh(cov_a)\n eigvals = np.maximum(eigvals, 0.0)\n z = self.rng.standard_normal(self.context_dim)\n theta_sample = mu_a + eigvecs @ (np.sqrt(eigvals) * z)\n val = np.dot(theta_sample, context)\n if val > best_val:\n best_val = val\n best_arm = a\n return int(best_arm)\n else:\n for a in range(self.K):\n if len(self.history[a]) == 0:\n return a\n \n best_arm = 0\n best_ucb = -np.inf\n for a in range(self.K):\n n_a = len(self.history[a])\n mu_a = sum(self.history[a]) / n_a\n ucb = kl_ucb_bound(mu_a, n_a, self.era_t, c=1.0)\n if ucb > best_ucb:\n best_ucb = ucb\n best_arm = a\n return int(best_arm)\n\n def update(self, arm: int, reward: float, context: np.ndarray | None = None):\n if self.context_dim > 0 and context is not None:\n x = context\n B = self._B_inv[arm]\n Bx = B @ x\n xBx = np.dot(x, Bx)\n self._B_inv[arm] = B - np.outer(Bx, Bx) / (1.0 + xBx)\n self._f[arm] += reward * x\n self._theta_hat[arm] = self._B_inv[arm] @ self._f[arm]\n else:\n self.history[arm].append(reward)\n self.era_t += 1\n \n n = len(self.history[arm])\n recent_changes = False\n for W in [20, 50, 100, 200]:\n if n > W + W // 2:\n n_recent = W\n n_past = n - W\n sum_recent = sum(self.history[arm][-W:])\n sum_past = sum(self.history[arm][:-W])\n recent_mean = sum_recent / n_recent\n past_mean = sum_past / n_past\n all_mean = (sum_recent + sum_past) / n\n \n glr = n_past * kl_bernoulli(past_mean, all_mean) + n_recent * kl_bernoulli(recent_mean, all_mean)\n if glr > 12.0:\n recent_changes = True\n break\n if recent_changes:\n self.reset_mab()\n",
5 "start_line": 261,
6 "op": "replace",
7 "after_line": 1
8}ERROR: Package 'custom_bandit.py' is not in allowed packages
Tool Input
1{
2 "start_line": 261,
3 "content": "class BanditPolicy:\n def __init__(self, K: int, context_dim: int = 0):\n self.K = K\n self.context_dim = context_dim\n self.rng = np.random.default_rng(42)\n \n if self.context_dim > 0:\n self._lambda = 0.4\n self._v2 = 0.01\n self._B_inv = np.array([np.eye(self.context_dim) / self._lambda for _ in range(self.K)])\n self._f = np.zeros((self.K, self.context_dim), dtype=np.float64)\n self._theta_hat = np.zeros((self.K, self.context_dim), dtype=np.float64)\n else:\n self.reset_mab()\n\n def reset_mab(self):\n self.era_t = 0\n self.history = [[] for _ in range(self.K)]\n\n def reset(self):\n if self.context_dim > 0:\n self._B_inv = np.array([np.eye(self.context_dim) / self._lambda for _ in range(self.K)])\n self._f[:] = 0.0\n self._theta_hat[:] = 0.0\n else:\n self.reset_mab()\n\n def select_arm(self, t: int, context: np.ndarray | None = None) -> int:\n if self.context_dim > 0 and context is not None:\n best_arm = 0\n best_val = -np.inf\n for a in range(self.K):\n mu_a = self._theta_hat[a]\n cov_a = self._v2 * self._B_inv[a]\n eigvals, eigvecs = np.linalg.eigh(cov_a)\n eigvals = np.maximum(eigvals, 0.0)\n z = self.rng.standard_normal(self.context_dim)\n theta_sample = mu_a + eigvecs @ (np.sqrt(eigvals) * z)\n val = np.dot(theta_sample, context)\n if val > best_val:\n best_val = val\n best_arm = a\n return int(best_arm)\n else:\n for a in range(self.K):\n if len(self.history[a]) == 0:\n return a\n \n best_arm = 0\n best_ucb = -np.inf\n for a in range(self.K):\n n_a = len(self.history[a])\n mu_a = sum(self.history[a]) / n_a\n ucb = kl_ucb_bound(mu_a, n_a, self.era_t, c=1.0)\n if ucb > best_ucb:\n best_ucb = ucb\n best_arm = a\n return int(best_arm)\n\n def update(self, arm: int, reward: float, context: np.ndarray | None = None):\n if self.context_dim > 0 and context is not None:\n x = context\n B = self._B_inv[arm]\n Bx = B @ x\n xBx = np.dot(x, Bx)\n self._B_inv[arm] = B - np.outer(Bx, Bx) / (1.0 + xBx)\n self._f[arm] += reward * x\n self._theta_hat[arm] = self._B_inv[arm] @ self._f[arm]\n else:\n self.history[arm].append(reward)\n self.era_t += 1\n \n n = len(self.history[arm])\n recent_changes = False\n for W in [20, 50, 100, 200]:\n if n > W + W // 2:\n n_recent = W\n n_past = n - W\n sum_recent = sum(self.history[arm][-W:])\n sum_past = sum(self.history[arm][:-W])\n recent_mean = sum_recent / n_recent\n past_mean = sum_past / n_past\n all_mean = (sum_recent + sum_past) / n\n \n glr = n_past * kl_bernoulli(past_mean, all_mean) + n_recent * kl_bernoulli(recent_mean, all_mean)\n if glr > 12.0:\n recent_changes = True\n break\n if recent_changes:\n self.reset_mab()",
4 "filename": "SMPyBandits/custom_bandit.py",
5 "after_line": 1,
6 "end_line": 321,
7 "op": "replace"
8}OK: Replaced lines 261..321 with 90 line(s) in SMPyBandits/custom_bandit.py. Editable range: 261–350. [Current file: SMPyBandits/custom_bandit.py | editable: 261–350 | total: 424 lines] 258| # ===================================================================== 259| # EDITABLE: BanditPolicy 260| # ===================================================================== 261 class BanditPolicy: 262 def __init__(self, K: int, context_dim: int = 0): 263 self.K = K 264 self.context_dim = context_dim 265 self.rng = np.random.default_rng(42) 266 267 if self.context_dim > 0: 268 self._lambda = 0.4 269 self._v2 = 0.01 270 self._B_inv = np.array([np.eye(self.context_dim) / self._lambda for _ in range(self.K)]) 271 self._f = np.zeros((self.K, self.context_dim), dtype=np.float64) 272 self._theta_hat = np.zeros((self.K, self.context_dim), d
[Test #1] (2 tests remaining; call submit(n=1) to submit this result) ### stochastic-mab (scripts/stochastic_mab.sh) Training metrics (last steps): TRAIN_METRICS step=6000 cumulative_regret=86.5000 normalized_regret=0.014417 TRAIN_METRICS step=7000 cumulative_regret=90.8000 normalized_regret=0.012971 TRAIN_METRICS step=8000 cumulative_regret=90.8000 normalized_regret=0.011350 TRAIN_METRICS step=9000 cumulative_regret=91.6000 normalized_regret=0.010178 TRAIN_METRICS step=10000 cumulative_regret=104.2500 normalized_regret=0.010425 Final metrics (stochastic-mab): cumulative_regret_stochastic_mab: 104.250000 normalized_regret_stochastic_mab: 0.010425 ### contextual (scripts/contextual.sh) Training metrics (last steps): TRAIN_METRICS step=6000 cumulative_regret=33.8003 normalized_regret=0.005633 TRAIN_METRICS step=7000 cumulative_regret=34.2721 normalized_regret=0.004896 TRAIN_METRICS step=8000 cumulative_regret=34.6856 normalized_regret=0.004336 TRAIN_METRICS step=9000 cumulative_regret=
Tool Input
1{
2 "n": 1
3}[submit] Submitting result from test #1 as final. ### stochastic-mab (scripts/stochastic_mab.sh) Training metrics (last steps): TRAIN_METRICS step=6000 cumulative_regret=86.5000 normalized_regret=0.014417 TRAIN_METRICS step=7000 cumulative_regret=90.8000 normalized_regret=0.012971 TRAIN_METRICS step=8000 cumulative_regret=90.8000 normalized_regret=0.011350 TRAIN_METRICS step=9000 cumulative_regret=91.6000 normalized_regret=0.010178 TRAIN_METRICS step=10000 cumulative_regret=104.2500 normalized_regret=0.010425 Final metrics (stochastic-mab): cumulative_regret_stochastic_mab: 104.250000 normalized_regret_stochastic_mab: 0.010425 ### contextual (scripts/contextual.sh) Training metrics (last steps): TRAIN_METRICS step=6000 cumulative_regret=33.8003 normalized_regret=0.005633 TRAIN_METRICS step=7000 cumulative_regret=34.2721 normalized_regret=0.004896 TRAIN_METRICS step=8000 cumulative_regret=34.6856 normalized_regret=0.004336 TRAIN_METRICS step=9000 cumulative_regret=35.2350 normalized_r