Skip to main content

Complete Guide to Bandit Algorithms for Personalization


Introduction to Bandit Personalization

Bandit personalization refers to using multi-armed bandit algorithms to deliver personalized experiences to users in software applications. It's a machine learning approach that solves the classic "exploration vs exploitation" problem in personalization.



How it Works

The algorithm treats each personalization option (like different content recommendations, UI layouts, or product suggestions) as an "arm" of a slot machine. For each user interaction, it must decide whether to:

  • Exploit: Show the option that currently appears to work best for that user
  • Explore: Try a different option to potentially discover something even better

Key Advantages

Benefits over Traditional A/B Testing

  • Real-time learning: Unlike A/B tests that require weeks of data collection, bandit algorithms adapt continuously as they gather feedback from user interactions.
  • Reduced opportunity cost: Instead of showing suboptimal experiences to 50% of users during traditional A/B testing, bandits quickly shift traffic toward better-performing options.
  • Individual personalization: The algorithm can learn different preferences for different user segments or even individual users.

Common Applications

  • Content recommendations: News sites, streaming platforms, and social media feeds
  • Product recommendations: E-commerce platforms suggesting items
  • Email marketing: Optimizing subject lines, send times, or content
  • Ad targeting: Selecting which ads to show to which users
  • Website optimization: Testing different layouts, colors, or call-to-action buttons

Core Bandit Algorithms

There are three fundamental bandit algorithms that form the foundation for understanding this field:

1. Epsilon-Greedy

Algorithm Overview

Approach: Simple random exploration with fixed probability.

How it works:

  • With probability epsilon (e.g., 0.1), choose a random arm (explore)
  • Otherwise, choose the arm with the highest average reward (exploit)

Advantages

  • Extremely simple to understand and implement
  • Predictable exploration behavior
  • Good baseline for comparison

Disadvantages

  • Fixed exploration rate doesn't adapt to uncertainty
  • Explores randomly rather than intelligently

2. Thompson Sampling

Algorithm Overview

Approach: Bayesian inference using Beta distributions to model uncertainty.

How it works:

  • Maintains alpha (successes + 1) and beta (failures + 1) for each arm
  • Samples from each arm's Beta distribution
  • Chooses the arm with the highest sample

Advantages

  • Theoretically optimal regret bounds
  • No hyperparameters to tune
  • Naturally balances exploration and exploitation through uncertainty
  • More exploration for uncertain arms, less for well-known arms

Key insight: The algorithm asks "Given what I've observed, what might this arm's true success rate be?" and samples a plausible answer.

3. UCB (Upper Confidence Bound)

Algorithm Overview

Approach: Uses confidence intervals to balance average reward with uncertainty.

How it works:

  • Calculates upper confidence bound: average_reward + confidence_interval
  • Confidence interval decreases as more data is collected
  • Always picks the arm with the highest UCB value

Formula: UCB = average_reward + c * sqrt(log(total_pulls) / arm_pulls)

Advantages

  • Deterministic decision making
  • Clear mathematical intuition
  • Good theoretical guarantees
  • Explicitly quantifies the exploration-exploitation tradeoff

Code Implementations

Complete Implementation of All Three Algorithms

import random
import math
import numpy as np

# ===== BANDIT ALGORITHMS =====

class EpsilonGreedyBandit:
    def __init__(self, n_arms, epsilon=0.1):
        """
        Epsilon-Greedy bandit implementation.
        
        Args:
            n_arms: Number of arms (options) to choose from
            epsilon: Probability of exploring (vs exploiting)
        """
        self.n_arms = n_arms
        self.epsilon = epsilon
        
        # Track total reward and count for each arm
        self.arm_rewards = [0.0] * n_arms
        self.arm_counts = [0] * n_arms
    
    def get_average_reward(self, arm):
        """Calculate average reward for an arm."""
        if self.arm_counts[arm] == 0:
            return 0.0
        return self.arm_rewards[arm] / self.arm_counts[arm]
    
    def select_arm(self):
        """Select an arm using epsilon-greedy strategy."""
        # Explore: choose random arm
        if random.random() < self.epsilon:
            return random.randint(0, self.n_arms - 1)
        
        # Exploit: choose arm with highest average reward
        best_arm = 0
        best_avg = self.get_average_reward(0)
        
        for arm in range(1, self.n_arms):
            avg_reward = self.get_average_reward(arm)
            if avg_reward > best_avg:
                best_avg = avg_reward
                best_arm = arm
        
        return best_arm
    
    def update(self, arm, reward):
        """Update the bandit with the reward received."""
        self.arm_counts[arm] += 1
        self.arm_rewards[arm] += reward


class ThompsonSamplingBandit:
    def __init__(self, n_arms):
        """
        Thompson Sampling bandit using Beta distributions.
        """
        self.n_arms = n_arms
        
        # Beta distribution parameters
        self.alpha = [1.0] * n_arms  # Start with uniform prior
        self.beta = [1.0] * n_arms
        
        # Track statistics for reporting
        self.arm_counts = [0] * n_arms
        self.arm_rewards = [0.0] * n_arms
    
    def select_arm(self):
        """Select an arm by sampling from Beta distributions."""
        samples = []
        
        # Sample from each arm's Beta distribution
        for arm in range(self.n_arms):
            sample = np.random.beta(self.alpha[arm], self.beta[arm])
            samples.append(sample)
        
        # Choose arm with highest sample
        return samples.index(max(samples))
    
    def update(self, arm, reward):
        """Update the Beta distribution parameters."""
        self.arm_counts[arm] += 1
        self.arm_rewards[arm] += reward
        
        if reward == 1:
            self.alpha[arm] += 1  # Success
        else:
            self.beta[arm] += 1   # Failure


class UCBBandit:
    def __init__(self, n_arms, c=2.0):
        """
        Upper Confidence Bound bandit.
        """
        self.n_arms = n_arms
        self.c = c
        
        # Track total reward and count for each arm
        self.arm_rewards = [0.0] * n_arms
        self.arm_counts = [0] * n_arms
        self.total_counts = 0
    
    def get_ucb_value(self, arm):
        """Calculate Upper Confidence Bound for an arm."""
        if self.arm_counts[arm] == 0:
            return float('inf')  # Unplayed arms have infinite UCB
        
        avg_reward = self.arm_rewards[arm] / self.arm_counts[arm]
        confidence_interval = self.c * math.sqrt(
            math.log(self.total_counts) / self.arm_counts[arm]
        )
        
        return avg_reward + confidence_interval
    
    def select_arm(self):
        """Select arm with highest Upper Confidence Bound."""
        best_arm = 0
        best_ucb = self.get_ucb_value(0)
        
        for arm in range(1, self.n_arms):
            ucb_value = self.get_ucb_value(arm)
            if ucb_value > best_ucb:
                best_ucb = ucb_value
                best_arm = arm
        
        return best_arm
    
    def update(self, arm, reward):
        """Update the bandit with the reward received."""
        self.arm_counts[arm] += 1
        self.arm_rewards[arm] += reward
        self.total_counts += 1

Usage Example

# Create a bandit with 4 options, 10% exploration rate
bandit = EpsilonGreedyBandit(n_arms=4, epsilon=0.1)

# In your application loop:
chosen_option = bandit.select_arm()
# ... show the chosen option to user ...
# ... measure success (1 for success, 0 for failure) ...
bandit.update(chosen_option, reward)

Algorithm Comparison

Performance Characteristics

When you run comparison simulations, you'll typically observe:

Thompson Sampling

  • Often achieves the highest total reward
  • More aggressive exploration early on
  • Quickly converges to optimal arm
  • Particularly effective with strong prior beliefs

UCB

  • Provides consistent, reliable performance
  • Systematic approach to exploration-exploitation
  • Good theoretical guarantees
  • Deterministic behavior (reproducible results)

Epsilon-Greedy

  • Simple and predictable
  • Less efficient than the other two
  • Fixed exploration rate doesn't adapt
  • Good baseline for comparison

When to Use Each Algorithm

  • Epsilon-Greedy: When simplicity is paramount, as a baseline for comparison, when you need predictable exploration behavior
  • Thompson Sampling: When you want optimal performance, for binary reward scenarios, when you can tolerate some randomness
  • UCB: When you need deterministic behavior, for scenarios requiring theoretical guarantees, when you want to tune exploration vs exploitation

Extended Bandit Algorithm Landscape

Beyond the three core algorithms, the bandit field includes dozens of specialized approaches:

Classic Multi-Armed Bandits (Stationary)

  • UCB1, UCB2: Variants of Upper Confidence Bound
  • Optimistic Initial Values: Start with high reward estimates
  • Gradient Bandit: Uses action preferences and softmax selection
  • EXP3: Exponential-weight algorithm for adversarial settings
  • Successive Elimination: Eliminates clearly suboptimal arms
  • MOSS: Minimax Optimal Strategy in Stochastic case

Contextual Bandits (Feature-based)

  • LinUCB: Linear UCB with feature vectors
  • LinTS: Linear Thompson Sampling
  • Neural Contextual Bandits: Deep learning approaches
  • Contextual EXP4: Adversarial contextual version

Adversarial Bandits

  • EXP3: Works without stochastic assumptions
  • EXP4: With expert advice
  • Hedge Algorithm: Classic adversarial approach

Non-Stationary/Restless Bandits

  • Discounted UCB: Recent rewards matter more
  • Sliding Window UCB: Only consider recent history
  • Change Detection: Detect when distributions shift

Specialized Variants

  • Combinatorial Bandits: Select multiple arms simultaneously
  • Dueling Bandits: Pairwise comparisons between arms
  • Sleeping Bandits: Arms sometimes unavailable
  • Fair Bandits: Ensuring fairness across groups

Technical Deep Dive

Understanding np.random.beta()

The np.random.beta(alpha, beta) function samples from a Beta distribution, which is crucial for Thompson Sampling:

Beta Distribution Properties

  • Always produces values between 0 and 1
  • Perfect for modeling probabilities or success rates
  • Parameters: alpha (successes + 1), beta (failures + 1)
  • Mean = alpha / (alpha + beta)

Examples:

# Equal parameters = uniform distribution
np.random.beta(1, 1)  # Could be anywhere from 0 to 1

# More successes = skewed toward 1
np.random.beta(8, 3)  # Mean = 0.73, likely 0.6-0.9

# More failures = skewed toward 0
np.random.beta(2, 10) # Mean = 0.17, likely 0.05-0.3

# Lots of data = concentrated around mean
np.random.beta(80, 20) # Mean = 0.8, very likely 0.75-0.85

Real-World Adaptations

For production personalization systems, consider:

  • Replace binary rewards with continuous metrics like click-through rates, conversion rates, or engagement scores.
  • Add user context by maintaining separate bandits for different user segments or using contextual bandit algorithms.
  • Handle non-stationarity with discounted rewards or change detection for evolving user preferences.
  • Scale considerations for high-traffic applications with distributed computing and approximate algorithms.

Implementation Considerations

  • State Management: Store bandit state in databases for persistence across sessions.
  • A/B Testing Integration: Use bandits for fine-tuning after initial A/B test phases.
  • Monitoring: Track regret, convergence, and exploration rates in production.
  • Fallback Strategies: Handle edge cases like cold start problems and system failures.

Conclusion

Bandit algorithms provide a powerful framework for real-time personalization that adapts continuously to user behavior. The three core algorithms—Epsilon-Greedy, Thompson Sampling, and UCB—form the foundation for understanding this rich field.

Start with these fundamental approaches, understand their trade-offs, then explore specialized variants as your requirements evolve. The key is matching the algorithm to your specific constraints: stationary vs changing environments, available contextual features, computational resources, and required theoretical guarantees.

This guide provides both the conceptual understanding and practical implementation needed to begin applying bandit algorithms to personalization challenges in production systems.

Comments