What is Reinforcement Learning?
From Wikipedia : "Reinforcement learning is an area of machine learning inspired by behaviorist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward." . Let us break this definition down:
- Reinforcement Learning Maps situations to actions to maximize a numerical reward.
- Unlike supervised learning, the learner is not told which actions to take but must discover which actions yield the most reward by trying them.
- Highly useful in cases of significant uncertainty about the environment
What are Multi Arm Bandits (MAB)?
“Bandit problems embody in essential form a conflict evident in all human action: choosing actions which yield immediate reward vs. choosing actions (e.g. acquiring information or preparing the ground) whose benefit will come only later” - P. Whittle (1980).
MAB is best understood through this analogy:
A gambler at a row of slot machines has to decide which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a reward from a distribution specific to that machine. The objective is to maximize the sum of rewards earned through a sequence of lever pulls.
- Exploitation : Go to your favorite restaurant
- Exploration: Try a new restaurant
- Exploitation : Continue using existing well
- Exploration: Drill at a new location
- Exploitation: Show the most successful advert
- Exploration :Show a different advert
•Lets Index the arms by a, and the probability distribution over possible rewards r for each arm a can be written as pa(r) .
• We have to find the arm with the largest mean reward μa=Ea[r]
•In practice pa(r) are non-stationary
So how does one come up with an optimal (albeit approximate) strategy to explore and exploit so as to reap maximum rewards? There are two classes of well known strategies in literature:
- Greedy : the best lever (based on previous trials) is always pulled except when a (uniformly) random action is taken. A popular example of a Greedy strategy is called Epsilon Greedy.
- Probabilistic Matching : the number of pulls for a given lever should match its actual probability of being the optimal lever. Example: Thompson Sampling
Select the best lever most of the time, pull a random lever some of the time (show random ads sometimes, and the best
ad most of the time). In this strategy we choose to pull a new lever (explore) with a frequency of epsilon (hence the name). Hence epsilon is the fraction of times we sample a lever randomly and 1- epsilon is the fraction of times we choose optimally. An epsilon value of 1.0 will cause it to always explore, while a value of 0.0 will cause it to always exploit the best preforming lever.
- Very easy to implement.
- Will not get stuck in some local optimal state.
- The best performing arm will be used most of the time.
- How do you pick the value epsilon? This is a tricky problem to solve and the wrong epsilon value could lead to either 0 exploration or too much exploration.
Thompson Sampling is a randomized algorithm based on Bayesian ideas. The first version of this Bayesian heuristic is more than 80 years old, dating to Thompson (1933). It is a member of the family of randomized probability matching
algorithms. The basic idea is to assume a simple prior distribution on the underlying parameters of the reward distribution of every lever, and at every time step, play a lever according to its posterior probability of being the best arm.
In other words, we encode our belief about where the expected reward μa is for lever a in probability distribution p(μa|data). For example, if each lever is an advert, p(μa|data) could be the probability distribution of the mean CTR of the advert given historical data about the advert. When the reward is a binomial (click Vs no click) a Beta-Binomial model is usually a convenient choice.
- Easy to implement.
- Robust against delayed feedback: Often feedback about rewards is not immediately available. Imagine each lever being a product on sale on an eCommerce site. The reward is sold Vs not sold. The knowledge that a product was sold may not be immediately available to the bandit system. In this case, Thompson Sampling (being randomized) will keep exploring (because its randomized) instead of being stuck showing a sub performing product.
- Exploration may cease to exist if the algorithm converges quickly. Continuing with the previous example, a product that gets a few sales may continue getting more sales (simply because its shown more) and may overwhelm new products for being shown. Thompson sampling hence needs careful tuning and experimentation.
- Kuleshov, Volodymyr, and Doina Precup. "Algorithms for multi-armed bandit problems." arXiv preprint arXiv:1402.6028 (2014).
- Chapelle, Olivier, and Lihong Li. "An empirical evaluation of thompson sampling." Advances in neural information processing systems. 2011.
- Agrawal, Shipra, and Navin Goyal. "Analysis of Thompson sampling for the multi-armed bandit problem." arXiv preprint arXiv:1111.1797 (2011).