Understanding the exploration-exploitation tradeoff
Imagine you're in a casino facing a row of slot machines (historically called "one-armed bandits"). Each machine has a different, unknown probability of paying out. You have a limited number of pulls—how do you maximize your total winnings?
This is the Multi-Armed Bandit (MAB) problem: a classic framework in decision-making under uncertainty. The challenge lies in balancing two competing objectives:
Trying different arms to learn their reward probabilities
Pulling the arm that currently appears best to maximize reward
Too much exploration wastes pulls on suboptimal arms. Too much exploitation risks missing the truly best arm. Finding the optimal balance is what MAB algorithms aim to solve.
The MAB framework extends far beyond casinos. Any scenario involving sequential decisions with uncertain outcomes can be modeled as a bandit problem:
Allocate patients to treatments while learning which is most effective
Optimize website layouts, ads, or features in real-time
Select which ads to show to maximize click-through rates
Suggest content while learning user preferences
Choose communication channels with unknown reliability
Allocate investments across assets with uncertain returns
Several algorithms have been developed to solve the MAB problem. Each takes a different approach to the exploration-exploitation tradeoff.
Selects arms uniformly at random, ignoring past observations. This serves as a baseline—any reasonable algorithm should outperform pure random selection.
Pros: Simple, no parameters
Cons: Does not learn; linear regret growth
UCB follows the principle of "optimism in the face of uncertainty." It selects the arm with the highest upper confidence bound—a combination of estimated reward plus an exploration bonus that decreases as the arm is pulled more.
Where \(\hat{\mu}_a\) is the empirical mean reward of arm \(a\), \(N_a(t)\) is the number of times arm \(a\) has been pulled, and \(c\) is a confidence parameter.
Pros: Strong theoretical guarantees; logarithmic regret
Cons: Can be slow to converge with many arms
A simple yet effective strategy: with probability \(\varepsilon\), explore by choosing a random arm; otherwise, exploit by choosing the arm with the highest observed mean reward.
The parameter \(\varepsilon\) controls the exploration rate. Common values range from 0.01 to 0.1.
Pros: Simple to implement and tune
Cons: Fixed exploration rate may not be optimal; explores uniformly rather than targeting uncertainty
A Bayesian approach that maintains a probability distribution over each arm's true reward rate. At each step, it samples from these distributions and selects the arm with the highest sample.
For binary rewards (win/lose), we use Beta distributions:
After observing reward \(r \in \{0, 1\}\), update: \(\alpha_a \leftarrow \alpha_a + r\) and \(\beta_a \leftarrow \beta_a + (1 - r)\).
Pros: Often achieves best empirical performance; naturally balances exploration/exploitation
Cons: Requires specifying prior distributions
| Algorithm | Approach | Regret Bound | Best For |
|---|---|---|---|
| Random | Uniform random | O(T) - Linear | Baseline comparison |
| UCB | Optimistic estimates | O(log T) | When theoretical guarantees matter |
| ε-Greedy | Random exploration | O(T) or O(log T)* | Simple problems; quick implementation |
| Thompson | Bayesian sampling | O(log T) | Best empirical performance |
*ε-Greedy achieves O(log T) regret only with a decaying exploration rate.
The total reward accumulated over all pulls. Higher is better.
The difference between the optimal strategy (always pulling the best arm) and the algorithm's actual performance. Lower is better.
Where \(\mu^*\) is the true probability of the best arm. Good algorithms have regret that grows sub-linearly (ideally logarithmically) in \(T\).