optimization-online-bandit

OptimizationSMPyBanditsrigorous codebase

Description

Online Bandits: Exploration-Exploitation Strategy Design

Objective

Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. Your code goes in custom_bandit.py. Three reference implementations (UCB1, Thompson Sampling, KL-UCB) are available as read-only in the SMPyBandits package.

Background

The multi-armed bandit problem is a fundamental model for the exploration-exploitation tradeoff in sequential decision-making. At each round, an agent selects one of K arms and observes a stochastic reward. The goal is to minimize cumulative regret -- the gap between the reward of the best arm (in hindsight) and the agent's actual reward.

Classic algorithms include:

UCB1 (Auer et al., 2002): plays the arm with the highest upper confidence bound on its mean reward, achieving O(sqrt(KT log T)) minimax regret.
Thompson Sampling (Thompson 1933; Agrawal & Goyal 2012): samples from a Bayesian posterior and plays the arm with the highest sample, achieving optimal Bayesian regret.
KL-UCB (Garivier & Cappe 2011; Cappe et al. 2013): uses Kullback-Leibler divergence for tighter confidence bounds, provably optimal for Bernoulli bandits.

Key challenges include adapting to different reward distributions, handling contextual information, and detecting non-stationarity.

Task

Modify the BanditPolicy class in custom_bandit.py (the EDITABLE section). You must implement:

__init__(K, context_dim): initialize your policy for K arms with optional context
select_arm(t, context): choose which arm to pull at timestep t
update(arm, reward, context): update internal state after observing a reward
reset(): reset state for a new run

Interface

class BanditPolicy:
    def __init__(self, K: int, context_dim: int = 0): ...
    def reset(self): ...
    def select_arm(self, t: int, context: np.ndarray | None = None) -> int: ...
    def update(self, arm: int, reward: float, context: np.ndarray | None = None): ...

Available utilities (in the FIXED section):

kl_bernoulli(p, q): KL divergence between Bernoulli distributions
kl_ucb_bound(mu_hat, n, t, c): computes the KL-UCB upper confidence bound

Evaluation

Evaluated on three bandit settings (lower regret is better):

Stochastic MAB: 10-armed Bernoulli bandit, T=10,000 rounds. Arms have fixed reward probabilities.
Contextual: 5-armed linear contextual bandit with d=10 features, T=10,000 rounds. Expected reward is a linear function of the context.
Non-stationary: 5-armed piece-wise stationary Bernoulli bandit with 4 abrupt changepoints, T=10,000 rounds. The best arm changes over time.

Metric: normalized cumulative regret = (cumulative regret) / T.

Code

custom_bandit.py

EditableRead-only

1# Custom online bandit algorithm for MLS-Bench
2#
3# EDITABLE section: BanditPolicy class — the exploration-exploitation strategy.
4# FIXED sections: everything else (environments, evaluation, main loop).
5#
6# Three evaluation settings:
7#   1. Stochastic MAB (K=10 Bernoulli arms, T=10000)
8#   2. Contextual Bandits (d=10 context, K=5 linear arms, T=10000)
9#   3. Non-stationary MAB (K=5 Bernoulli arms with 4 abrupt changes, T=10000)
10#
11# Metric: normalized cumulative regret at horizon T (lower is better).
12
13import argparse
14import math
15import os

Results

Show per-seed results

Model	Type	normalized regret stochastic mab ↓	normalized regret contextual ↓	normalized regret nonstationary ↓
kl_ucb	baseline	0.061	0.179	0.035
thompson_sampling	baseline	0.035	0.020	0.060
ucb1	baseline	0.145	0.181	0.067
anthropic/claude-opus-4.6	vanilla	0.008	0.020	0.084
deepseek-reasoner	vanilla	0.140	0.005	0.140
google/gemini-3.1-pro-preview	vanilla	0.010	0.004	0.015
openai/gpt-5.4-pro	vanilla	0.006	0.021	0.007
qwen3.6-plus:free	vanilla	0.103	0.012	0.050
anthropic/claude-opus-4.6	agent	0.008	0.020	0.096
anthropic/claude-opus-4.6	agent	0.005	0.012	0.023
deepseek-reasoner	agent	-	0.005	0.033
google/gemini-3.1-pro-preview	agent	0.010	0.004	0.015
openai/gpt-5.4-pro	agent	0.008	0.002	0.016
qwen3.6-plus:free	agent	0.012	0.013	0.013

Agent Conversations

anthropic/claude-opus-4.6

6 steps

deepseek-reasoner

7 steps

google/gemini-3.1-pro-preview

4 steps

openai/gpt-5.4-pro

8 steps