Complete Guide to Bandit Algorithms for Personalization

1. Introduction to Bandit Personalization
2. Core Bandit Algorithms
3. Code Implementations
4. Algorithm Comparison
5. Extended Algorithm Landscape
6. Technical Deep Dive

Introduction to Bandit Personalization

Bandit personalization refers to using multi-armed bandit algorithms to deliver personalized experiences to users in software applications. It's a machine learning approach that solves the classic "exploration vs exploitation" problem in personalization.

How it Works

The algorithm treats each personalization option (like different content recommendations, UI layouts, or product suggestions) as an "arm" of a slot machine. For each user interaction, it must decide whether to:

Exploit: Show the option that currently appears to work best for that user
Explore: Try a different option to potentially discover something even better

Key Advantages

Benefits over Traditional A/B Testing

Real-time learning: Unlike A/B tests that require weeks of data collection, bandit algorithms adapt continuously as they gather feedback from user interactions.
Reduced opportunity cost: Instead of showing suboptimal experiences to 50% of users during traditional A/B testing, bandits quickly shift traffic toward better-performing options.
Individual personalization: The algorithm can learn different preferences for different user segments or even individual users.

Common Applications

Content recommendations: News sites, streaming platforms, and social media feeds
Product recommendations: E-commerce platforms suggesting items
Email marketing: Optimizing subject lines, send times, or content
Ad targeting: Selecting which ads to show to which users
Website optimization: Testing different layouts, colors, or call-to-action buttons

Core Bandit Algorithms

There are three fundamental bandit algorithms that form the foundation for understanding this field:

1. Epsilon-Greedy

Algorithm Overview

Approach: Simple random exploration with fixed probability.

How it works:

With probability epsilon (e.g., 0.1), choose a random arm (explore)
Otherwise, choose the arm with the highest average reward (exploit)

Advantages

Extremely simple to understand and implement
Predictable exploration behavior
Good baseline for comparison

Disadvantages

Fixed exploration rate doesn't adapt to uncertainty
Explores randomly rather than intelligently

2. Thompson Sampling

Algorithm Overview

Approach: Bayesian inference using Beta distributions to model uncertainty.

How it works:

Maintains alpha (successes + 1) and beta (failures + 1) for each arm
Samples from each arm's Beta distribution
Chooses the arm with the highest sample

Advantages

Theoretically optimal regret bounds
No hyperparameters to tune
Naturally balances exploration and exploitation through uncertainty
More exploration for uncertain arms, less for well-known arms

Key insight: The algorithm asks "Given what I've observed, what might this arm's true success rate be?" and samples a plausible answer.

3. UCB (Upper Confidence Bound)

Algorithm Overview

Approach: Uses confidence intervals to balance average reward with uncertainty.

How it works:

Calculates upper confidence bound: average_reward + confidence_interval
Confidence interval decreases as more data is collected
Always picks the arm with the highest UCB value

Formula: UCB = average_reward + c * sqrt(log(total_pulls) / arm_pulls)

Advantages

Deterministic decision making
Clear mathematical intuition
Good theoretical guarantees
Explicitly quantifies the exploration-exploitation tradeoff

Code Implementations

Complete Implementation of All Three Algorithms

import random
import math
import numpy as np

# ===== BANDIT ALGORITHMS =====

class EpsilonGreedyBandit:
    def __init__(self, n_arms, epsilon=0.1):
        """
        Epsilon-Greedy bandit implementation.
        
        Args:
            n_arms: Number of arms (options) to choose from
            epsilon: Probability of exploring (vs exploiting)
        """
        self.n_arms = n_arms
        self.epsilon = epsilon
        
        # Track total reward and count for each arm
        self.arm_rewards = [0.0] * n_arms
        self.arm_counts = [0] * n_arms
    
    def get_average_reward(self, arm):
        """Calculate average reward for an arm."""
        if self.arm_counts[arm] == 0:
            return 0.0
        return self.arm_rewards[arm] / self.arm_counts[arm]
    
    def select_arm(self):
        """Select an arm using epsilon-greedy strategy."""
        # Explore: choose random arm
        if random.random() < self.epsilon:
            return random.randint(0, self.n_arms - 1)
        
        # Exploit: choose arm with highest average reward
        best_arm = 0
        best_avg = self.get_average_reward(0)
        
        for arm in range(1, self.n_arms):
            avg_reward = self.get_average_reward(arm)
            if avg_reward > best_avg:
                best_avg = avg_reward
                best_arm = arm
        
        return best_arm
    
    def update(self, arm, reward):
        """Update the bandit with the reward received."""
        self.arm_counts[arm] += 1
        self.arm_rewards[arm] += reward


class ThompsonSamplingBandit:
    def __init__(self, n_arms):
        """
        Thompson Sampling bandit using Beta distributions.
        """
        self.n_arms = n_arms
        
        # Beta distribution parameters
        self.alpha = [1.0] * n_arms  # Start with uniform prior
        self.beta = [1.0] * n_arms
        
        # Track statistics for reporting
        self.arm_counts = [0] * n_arms
        self.arm_rewards = [0.0] * n_arms
    
    def select_arm(self):
        """Select an arm by sampling from Beta distributions."""
        samples = []
        
        # Sample from each arm's Beta distribution
        for arm in range(self.n_arms):
            sample = np.random.beta(self.alpha[arm], self.beta[arm])
            samples.append(sample)
        
        # Choose arm with highest sample
        return samples.index(max(samples))
    
    def update(self, arm, reward):
        """Update the Beta distribution parameters."""
        self.arm_counts[arm] += 1
        self.arm_rewards[arm] += reward
        
        if reward == 1:
            self.alpha[arm] += 1  # Success
        else:
            self.beta[arm] += 1   # Failure


class UCBBandit:
    def __init__(self, n_arms, c=2.0):
        """
        Upper Confidence Bound bandit.
        """
        self.n_arms = n_arms
        self.c = c
        
        # Track total reward and count for each arm
        self.arm_rewards = [0.0] * n_arms
        self.arm_counts = [0] * n_arms
        self.total_counts = 0
    
    def get_ucb_value(self, arm):
        """Calculate Upper Confidence Bound for an arm."""
        if self.arm_counts[arm] == 0:
            return float('inf')  # Unplayed arms have infinite UCB
        
        avg_reward = self.arm_rewards[arm] / self.arm_counts[arm]
        confidence_interval = self.c * math.sqrt(
            math.log(self.total_counts) / self.arm_counts[arm]
        )
        
        return avg_reward + confidence_interval
    
    def select_arm(self):
        """Select arm with highest Upper Confidence Bound."""
        best_arm = 0
        best_ucb = self.get_ucb_value(0)
        
        for arm in range(1, self.n_arms):
            ucb_value = self.get_ucb_value(arm)
            if ucb_value > best_ucb:
                best_ucb = ucb_value
                best_arm = arm
        
        return best_arm
    
    def update(self, arm, reward):
        """Update the bandit with the reward received."""
        self.arm_counts[arm] += 1
        self.arm_rewards[arm] += reward
        self.total_counts += 1

Usage Example

# Create a bandit with 4 options, 10% exploration rate
bandit = EpsilonGreedyBandit(n_arms=4, epsilon=0.1)

# In your application loop:
chosen_option = bandit.select_arm()
# ... show the chosen option to user ...
# ... measure success (1 for success, 0 for failure) ...
bandit.update(chosen_option, reward)

Algorithm Comparison

Performance Characteristics

When you run comparison simulations, you'll typically observe:

Thompson Sampling

Often achieves the highest total reward
More aggressive exploration early on
Quickly converges to optimal arm
Particularly effective with strong prior beliefs

UCB

Provides consistent, reliable performance
Systematic approach to exploration-exploitation
Good theoretical guarantees
Deterministic behavior (reproducible results)

Epsilon-Greedy

Simple and predictable
Less efficient than the other two
Fixed exploration rate doesn't adapt
Good baseline for comparison

When to Use Each Algorithm

Epsilon-Greedy: When simplicity is paramount, as a baseline for comparison, when you need predictable exploration behavior
Thompson Sampling: When you want optimal performance, for binary reward scenarios, when you can tolerate some randomness
UCB: When you need deterministic behavior, for scenarios requiring theoretical guarantees, when you want to tune exploration vs exploitation

Extended Bandit Algorithm Landscape

Beyond the three core algorithms, the bandit field includes dozens of specialized approaches:

Classic Multi-Armed Bandits (Stationary)

UCB1, UCB2: Variants of Upper Confidence Bound
Optimistic Initial Values: Start with high reward estimates
Gradient Bandit: Uses action preferences and softmax selection
EXP3: Exponential-weight algorithm for adversarial settings
Successive Elimination: Eliminates clearly suboptimal arms
MOSS: Minimax Optimal Strategy in Stochastic case

Contextual Bandits (Feature-based)

LinUCB: Linear UCB with feature vectors
LinTS: Linear Thompson Sampling
Neural Contextual Bandits: Deep learning approaches
Contextual EXP4: Adversarial contextual version

Adversarial Bandits

EXP3: Works without stochastic assumptions
EXP4: With expert advice
Hedge Algorithm: Classic adversarial approach

Non-Stationary/Restless Bandits

Discounted UCB: Recent rewards matter more
Sliding Window UCB: Only consider recent history
Change Detection: Detect when distributions shift

Specialized Variants

Combinatorial Bandits: Select multiple arms simultaneously
Dueling Bandits: Pairwise comparisons between arms
Sleeping Bandits: Arms sometimes unavailable
Fair Bandits: Ensuring fairness across groups

Technical Deep Dive

Understanding np.random.beta()

The np.random.beta(alpha, beta) function samples from a Beta distribution, which is crucial for Thompson Sampling:

Beta Distribution Properties

Always produces values between 0 and 1
Perfect for modeling probabilities or success rates
Parameters: alpha (successes + 1), beta (failures + 1)
Mean = alpha / (alpha + beta)

Examples:

# Equal parameters = uniform distribution
np.random.beta(1, 1)  # Could be anywhere from 0 to 1

# More successes = skewed toward 1
np.random.beta(8, 3)  # Mean = 0.73, likely 0.6-0.9

# More failures = skewed toward 0
np.random.beta(2, 10) # Mean = 0.17, likely 0.05-0.3

# Lots of data = concentrated around mean
np.random.beta(80, 20) # Mean = 0.8, very likely 0.75-0.85

Real-World Adaptations

For production personalization systems, consider:

Replace binary rewards with continuous metrics like click-through rates, conversion rates, or engagement scores.
Add user context by maintaining separate bandits for different user segments or using contextual bandit algorithms.
Handle non-stationarity with discounted rewards or change detection for evolving user preferences.
Scale considerations for high-traffic applications with distributed computing and approximate algorithms.

Implementation Considerations

State Management: Store bandit state in databases for persistence across sessions.
A/B Testing Integration: Use bandits for fine-tuning after initial A/B test phases.
Monitoring: Track regret, convergence, and exploration rates in production.
Fallback Strategies: Handle edge cases like cold start problems and system failures.

Conclusion

Bandit algorithms provide a powerful framework for real-time personalization that adapts continuously to user behavior. The three core algorithms—Epsilon-Greedy, Thompson Sampling, and UCB—form the foundation for understanding this rich field.

Start with these fundamental approaches, understand their trade-offs, then explore specialized variants as your requirements evolve. The key is matching the algorithm to your specific constraints: stationary vs changing environments, available contextual features, computational resources, and required theoretical guarantees.

This guide provides both the conceptual understanding and practical implementation needed to begin applying bandit algorithms to personalization challenges in production systems.

Fubar Ramblings