← Back to Simulator

The Multi-Armed Bandit Problem

Understanding the exploration-exploitation tradeoff

What is the Multi-Armed Bandit Problem?

Imagine you're in a casino facing a row of slot machines (historically called "one-armed bandits"). Each machine has a different, unknown probability of paying out. You have a limited number of pulls—how do you maximize your total winnings?

This is the Multi-Armed Bandit (MAB) problem: a classic framework in decision-making under uncertainty. The challenge lies in balancing two competing objectives:

Exploration:

Trying different arms to learn their reward probabilities

Exploitation:

Pulling the arm that currently appears best to maximize reward

Too much exploration wastes pulls on suboptimal arms. Too much exploitation risks missing the truly best arm. Finding the optimal balance is what MAB algorithms aim to solve.

Real-World Applications

The MAB framework extends far beyond casinos. Any scenario involving sequential decisions with uncertain outcomes can be modeled as a bandit problem:

🔬

Clinical Trials

Allocate patients to treatments while learning which is most effective

📱

A/B Testing

Optimize website layouts, ads, or features in real-time

🎯

Ad Placement

Select which ads to show to maximize click-through rates

🤖

Recommendation Systems

Suggest content while learning user preferences

📡

Network Routing

Choose communication channels with unknown reliability

💰

Portfolio Optimization

Allocate investments across assets with uncertain returns

Bandit Algorithms

Several algorithms have been developed to solve the MAB problem. Each takes a different approach to the exploration-exploitation tradeoff.

Random (Baseline)

Selects arms uniformly at random, ignoring past observations. This serves as a baseline—any reasonable algorithm should outperform pure random selection.

\[P(A_t = a) = \frac{1}{K} \quad \text{for all arms } a \in \{1, \ldots, K\}\]

Pros: Simple, no parameters
Cons: Does not learn; linear regret growth

Upper Confidence Bound (UCB)

UCB follows the principle of "optimism in the face of uncertainty." It selects the arm with the highest upper confidence bound—a combination of estimated reward plus an exploration bonus that decreases as the arm is pulled more.

\[A_t = \arg\max_a \left[ \hat{\mu}_a + c \sqrt{\frac{\ln t}{N_a(t)}} \right]\]

Where \(\hat{\mu}_a\) is the empirical mean reward of arm \(a\), \(N_a(t)\) is the number of times arm \(a\) has been pulled, and \(c\) is a confidence parameter.

Pros: Strong theoretical guarantees; logarithmic regret
Cons: Can be slow to converge with many arms

ε-Greedy (Epsilon-Greedy)

A simple yet effective strategy: with probability \(\varepsilon\), explore by choosing a random arm; otherwise, exploit by choosing the arm with the highest observed mean reward.

\[A_t = \begin{cases} \text{random arm} & \text{with probability } \varepsilon \\ \arg\max_a \hat{\mu}_a & \text{with probability } 1 - \varepsilon \end{cases}\]

The parameter \(\varepsilon\) controls the exploration rate. Common values range from 0.01 to 0.1.

Pros: Simple to implement and tune
Cons: Fixed exploration rate may not be optimal; explores uniformly rather than targeting uncertainty

Thompson Sampling

A Bayesian approach that maintains a probability distribution over each arm's true reward rate. At each step, it samples from these distributions and selects the arm with the highest sample.

For binary rewards (win/lose), we use Beta distributions:

\[\theta_a \sim \text{Beta}(\alpha_a, \beta_a)\] \[A_t = \arg\max_a \theta_a\]

After observing reward \(r \in \{0, 1\}\), update: \(\alpha_a \leftarrow \alpha_a + r\) and \(\beta_a \leftarrow \beta_a + (1 - r)\).

Pros: Often achieves best empirical performance; naturally balances exploration/exploitation
Cons: Requires specifying prior distributions

Algorithm Comparison

Algorithm Approach Regret Bound Best For
Random Uniform random O(T) - Linear Baseline comparison
UCB Optimistic estimates O(log T) When theoretical guarantees matter
ε-Greedy Random exploration O(T) or O(log T)* Simple problems; quick implementation
Thompson Bayesian sampling O(log T) Best empirical performance

*ε-Greedy achieves O(log T) regret only with a decaying exploration rate.

Key Metrics

Cumulative Reward

The total reward accumulated over all pulls. Higher is better.

\[\text{Cumulative Reward} = \sum_{t=1}^{T} R_t\]

Cumulative Regret

The difference between the optimal strategy (always pulling the best arm) and the algorithm's actual performance. Lower is better.

\[\text{Regret}(T) = T \cdot \mu^* - \sum_{t=1}^{T} R_t\]

Where \(\mu^*\) is the true probability of the best arm. Good algorithms have regret that grows sub-linearly (ideally logarithmically) in \(T\).

Ready to see these algorithms in action?

Try the Simulator →