Blog Archives

I'm a Bandit : Multi Arm Bandits and the Explore/Exploit trade-off

12/4/2015

This article is meant to be an introduction on Multi Arm Bandits, without getting into too much into the mathematical details (I'll leave that for a followup article!). We'll talk about reinforcement learning, multi arm bandit optimization and the exploration/exploitation trade off.

What is Reinforcement Learning?
From Wikipedia : "Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward." . Let us break this definition down:

Reinforcement Learning Maps situations to actions to maximize a numerical reward.
Unlike supervised learning, the learner is not told which actions to take but must discover which actions yield the most reward by trying them.
Highly useful in cases of significant uncertainty about the environment

In Reinforcement learning an agent simultaneously attempts to acquire new knowledge (called "exploration") and optimizes its decision based on existing knowledge (called "exploitation"). Multi Arm Bandits is a class of sequential resource allocation that embody the exploration/exploitation dilemma.

What are Multi Arm Bandits (MAB)?
“Bandit problems embody in essential form a conflict evident in all human action: choosing actions which yield immediate reward vs. choosing actions (e.g. acquiring information or preparing the ground) whose benefit will come only later” - P. Whittle (1980).
MAB is best understood through this analogy:
A gambler at a row of slot machines has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a reward from a distribution specific to that machine. The objective is to maximize the sum of rewards earned through a sequence of lever pulls.

Should I keep pulling the best lever so far or should I explore a new lever?

Some real world exploration/exploitation examples:
Restaurant Selection

Exploitation : Go to your favorite restaurant
Exploration: Try a new restaurant

Oil Drilling

Exploitation : Continue using existing well
Exploration: Drill at a new location

Online Banner Advertisements

Exploitation: Show the most successful advert
Exploration :Show a different advert

Lets get formal and introduce some notation (but not too much!):
•Lets Index the arms by a, and the probability distribution over possible rewards r for each arm a can be written as pa(r) .
• We have to find the arm with the largest mean reward μa=Ea[r]
•In practice pa(r) are non-stationary

So how does one come up with an optimal (albeit approximate) strategy to explore and exploit so as to reap maximum rewards? There are two classes of well known strategies in literature:

Greedy : the best lever (based on previous trials) is always pulled except when a (uniformly) random action is taken. A popular example of a Greedy strategy is called Epsilon Greedy.
Probabilistic Matching : the number of pulls for a given lever should match its actual probability of being the optimal lever. Example: Thompson Sampling

Lets look at both these options briefly:

Epsilon Greedy
Select the best lever most of the time, pull a random lever some of the time (show random ads sometimes, and the best
ad most of the time). In this strategy we choose to pull a new lever (explore) with a frequency of epsilon (hence the name). Hence epsilon is the fraction of times we sample a lever randomly and 1- epsilon is the fraction of times we choose optimally. An epsilon value of

1.0

will cause it to always explore, while a value of

0.0

will cause it to always exploit the best preforming lever.

Key Advantages:

Very easy to implement.
Will not get stuck in some local optimal state.
The best performing arm will be used most of the time.

Disadvantages:

How do you pick the value epsilon? This is a tricky problem to solve and the wrong epsilon value could lead to either 0 exploration or too much exploration.

Thompson Sampling

Thompson Sampling is a randomized algorithm based on Bayesian ideas. The first version of this Bayesian heuristic is more than 80 years old, dating to Thompson (1933). It is a member of the family of randomized probability matching
algorithms. The basic idea is to assume a simple prior distribution on the underlying parameters of the reward distribution of every lever, and at every time step, play a lever according to its posterior probability of being the best arm.

In other words, we encode our belief about where the expected reward $μa$ is for lever $a$ in probability distribution $p(μa|dataa)$

p(μa|dataa)

$p(μa|dataa)$

p(μa|dataa)

Key Advantages:

Easy to implement.
Robust against delayed feedback: Often feedback about rewards is not immediately available. Imagine each lever being a product on sale on an eCommerce site. The reward is sold Vs not sold. The knowledge that a product was sold may not be immediately available to the bandit system. In this case, Thompson Sampling (being randomized) will keep exploring (because its randomized) instead of being stuck showing a sub performing product.

Disadvantages:

Exploration may cease to exist if the algorithm converges quickly. Continuing with the previous example, a product that gets a few sales may continue getting more sales (simply because its shown more) and may overwhelm new products for being shown. Thompson sampling hence needs careful tuning and experimentation. $p(μa|dataa)$ $p(μa|dataa)$ $p(μa|dataa)$ $p(μa|dataa)$ $p(μa|dataa)$ $a$

That's it for now! I will follow up this article with a deeper look into Thompson Sampling and performance evaluation of Bandit Algorithms.

References

Kuleshov, Volodymyr, and Doina Precup. "Algorithms for multi-armed bandit problems." arXiv preprint arXiv:1402.6028 (2014).
Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling." Advances in neural information processing systems. 2011.
Agrawal, Shipra, and Navin Goyal. "Analysis of Thompson sampling for the multi-armed bandit problem." arXiv preprint arXiv:1111.1797 (2011).
https://en.wikipedia.org/wiki/Multi-armed_bandit

1 Comment

I'm a Bandit : Multi Arm Bandits and the Explore/Exploit trade-off

Sarvesh Saran

Archives

Categories