optimization-online-bandit
Description
Online Bandits: Exploration-Exploitation Strategy Design
Objective
Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. Your code goes in custom_bandit.py. Three reference implementations (UCB1, Thompson Sampling, KL-UCB) are available as read-only in the SMPyBandits package.
Background
The multi-armed bandit problem is a fundamental model for the exploration-exploitation tradeoff in sequential decision-making. At each round, an agent selects one of K arms and observes a stochastic reward. The goal is to minimize cumulative regret -- the gap between the reward of the best arm (in hindsight) and the agent's actual reward.
Classic algorithms include:
- UCB1 (Auer et al., 2002): plays the arm with the highest upper confidence bound on its mean reward, achieving O(sqrt(KT log T)) minimax regret.
- Thompson Sampling (Thompson 1933; Agrawal & Goyal 2012): samples from a Bayesian posterior and plays the arm with the highest sample, achieving optimal Bayesian regret.
- KL-UCB (Garivier & Cappe 2011; Cappe et al. 2013): uses Kullback-Leibler divergence for tighter confidence bounds, provably optimal for Bernoulli bandits.
Key challenges include adapting to different reward distributions, handling contextual information, and detecting non-stationarity.
Task
Modify the BanditPolicy class in custom_bandit.py (the EDITABLE section). You must implement:
__init__(K, context_dim): initialize your policy for K arms with optional contextselect_arm(t, context): choose which arm to pull at timestep tupdate(arm, reward, context): update internal state after observing a rewardreset(): reset state for a new run
Interface
class BanditPolicy:
def __init__(self, K: int, context_dim: int = 0): ...
def reset(self): ...
def select_arm(self, t: int, context: np.ndarray | None = None) -> int: ...
def update(self, arm: int, reward: float, context: np.ndarray | None = None): ...
Available utilities (in the FIXED section):
kl_bernoulli(p, q): KL divergence between Bernoulli distributionskl_ucb_bound(mu_hat, n, t, c): computes the KL-UCB upper confidence bound
Evaluation
Evaluated on three bandit settings (lower regret is better):
- Stochastic MAB: 10-armed Bernoulli bandit, T=10,000 rounds. Arms have fixed reward probabilities.
- Contextual: 5-armed linear contextual bandit with d=10 features, T=10,000 rounds. Expected reward is a linear function of the context.
- Non-stationary: 5-armed piece-wise stationary Bernoulli bandit with 4 abrupt changepoints, T=10,000 rounds. The best arm changes over time.
Metric: normalized cumulative regret = (cumulative regret) / T.
Code
1# Custom online bandit algorithm for MLS-Bench2#3# EDITABLE section: BanditPolicy class — the exploration-exploitation strategy.4# FIXED sections: everything else (environments, evaluation, main loop).5#6# Three evaluation settings:7# 1. Stochastic MAB (K=10 Bernoulli arms, T=10000)8# 2. Contextual Bandits (d=10 context, K=5 linear arms, T=10000)9# 3. Non-stationary MAB (K=5 Bernoulli arms with 4 abrupt changes, T=10000)10#11# Metric: normalized cumulative regret at horizon T (lower is better).1213import argparse14import math15import os
Results
| Model | Type | normalized regret stochastic mab ↓ | normalized regret contextual ↓ | normalized regret nonstationary ↓ |
|---|---|---|---|---|
| kl_ucb | baseline | 0.061 | 0.179 | 0.035 |
| thompson_sampling | baseline | 0.035 | 0.020 | 0.060 |
| ucb1 | baseline | 0.145 | 0.181 | 0.067 |
| anthropic/claude-opus-4.6 | vanilla | 0.008 | 0.020 | 0.084 |
| deepseek-reasoner | vanilla | 0.140 | 0.005 | 0.140 |
| google/gemini-3.1-pro-preview | vanilla | 0.010 | 0.004 | 0.015 |
| openai/gpt-5.4-pro | vanilla | 0.006 | 0.021 | 0.007 |
| qwen3.6-plus:free | vanilla | 0.103 | 0.012 | 0.050 |
| anthropic/claude-opus-4.6 | agent | 0.008 | 0.020 | 0.096 |
| anthropic/claude-opus-4.6 | agent | 0.005 | 0.012 | 0.023 |
| deepseek-reasoner | agent | - | 0.005 | 0.033 |
| google/gemini-3.1-pro-preview | agent | 0.010 | 0.004 | 0.015 |
| openai/gpt-5.4-pro | agent | 0.008 | 0.002 | 0.016 |
| qwen3.6-plus:free | agent | 0.012 | 0.013 | 0.013 |