optimization-online-bandit

OptimizationSMPyBanditsrigorous codebase

Description

Online Bandits: Exploration-Exploitation Strategy Design

Objective

Design and implement a bandit policy that minimizes cumulative regret across diverse multi-armed bandit settings. Your code goes in custom_bandit.py. Three reference implementations (UCB1, Thompson Sampling, KL-UCB) are available as read-only in the SMPyBandits package.

Background

The multi-armed bandit problem is a fundamental model for the exploration-exploitation tradeoff in sequential decision-making. At each round, an agent selects one of K arms and observes a stochastic reward. The goal is to minimize cumulative regret -- the gap between the reward of the best arm (in hindsight) and the agent's actual reward.

Classic algorithms include:

  • UCB1 (Auer et al., 2002): plays the arm with the highest upper confidence bound on its mean reward, achieving O(sqrt(KT log T)) minimax regret.
  • Thompson Sampling (Thompson 1933; Agrawal & Goyal 2012): samples from a Bayesian posterior and plays the arm with the highest sample, achieving optimal Bayesian regret.
  • KL-UCB (Garivier & Cappe 2011; Cappe et al. 2013): uses Kullback-Leibler divergence for tighter confidence bounds, provably optimal for Bernoulli bandits.

Key challenges include adapting to different reward distributions, handling contextual information, and detecting non-stationarity.

Task

Modify the BanditPolicy class in custom_bandit.py (the EDITABLE section). You must implement:

  • __init__(K, context_dim): initialize your policy for K arms with optional context
  • select_arm(t, context): choose which arm to pull at timestep t
  • update(arm, reward, context): update internal state after observing a reward
  • reset(): reset state for a new run

Interface

class BanditPolicy:
    def __init__(self, K: int, context_dim: int = 0): ...
    def reset(self): ...
    def select_arm(self, t: int, context: np.ndarray | None = None) -> int: ...
    def update(self, arm: int, reward: float, context: np.ndarray | None = None): ...

Available utilities (in the FIXED section):

  • kl_bernoulli(p, q): KL divergence between Bernoulli distributions
  • kl_ucb_bound(mu_hat, n, t, c): computes the KL-UCB upper confidence bound

Evaluation

Evaluated on three bandit settings (lower regret is better):

  1. Stochastic MAB: 10-armed Bernoulli bandit, T=10,000 rounds. Arms have fixed reward probabilities.
  2. Contextual: 5-armed linear contextual bandit with d=10 features, T=10,000 rounds. Expected reward is a linear function of the context.
  3. Non-stationary: 5-armed piece-wise stationary Bernoulli bandit with 4 abrupt changepoints, T=10,000 rounds. The best arm changes over time.

Metric: normalized cumulative regret = (cumulative regret) / T.

Code

custom_bandit.py
EditableRead-only
1# Custom online bandit algorithm for MLS-Bench
2#
3# EDITABLE section: BanditPolicy class — the exploration-exploitation strategy.
4# FIXED sections: everything else (environments, evaluation, main loop).
5#
6# Three evaluation settings:
7# 1. Stochastic MAB (K=10 Bernoulli arms, T=10000)
8# 2. Contextual Bandits (d=10 context, K=5 linear arms, T=10000)
9# 3. Non-stationary MAB (K=5 Bernoulli arms with 4 abrupt changes, T=10000)
10#
11# Metric: normalized cumulative regret at horizon T (lower is better).
12
13import argparse
14import math
15import os

Results

ModelTypenormalized regret stochastic mab normalized regret contextual normalized regret nonstationary
kl_ucbbaseline0.0610.1790.035
thompson_samplingbaseline0.0350.0200.060
ucb1baseline0.1450.1810.067
anthropic/claude-opus-4.6vanilla0.0080.0200.084
deepseek-reasonervanilla0.1400.0050.140
google/gemini-3.1-pro-previewvanilla0.0100.0040.015
openai/gpt-5.4-provanilla0.0060.0210.007
qwen3.6-plus:freevanilla0.1030.0120.050
anthropic/claude-opus-4.6agent0.0080.0200.096
anthropic/claude-opus-4.6agent0.0050.0120.023
deepseek-reasoneragent-0.0050.033
google/gemini-3.1-pro-previewagent0.0100.0040.015
openai/gpt-5.4-proagent0.0080.0020.016
qwen3.6-plus:freeagent0.0120.0130.013

Agent Conversations