Empirical Evaluation of the Bayesian Learning Automaton Family

2009

The two-armed bandit problem is a classical optimization problem where a player sequentially

selects and pulls one of two arms attached to a gambling machine, and each arm pull results in

either a reward or penalty to the player. Each arm is associated with a certain reward probability

which is unknown to the player, and the player needs to sequentially select and play an arm

and receive a reward or a penalty in order to discover its true reward probability. The overall

goal for the player is reward maximization, and the player needs to balance between exploiting

existing knowledge or obtaining new knowledge by trying different arms. In the long run it

may be beneficial to risk short term loss to gain greater certainty about the reward probability

associated with each arm.

The problem as described above is the simplest case of the more general k-armed bandit problem

and is concerned with the exploration vs. exploitation dilemma. A wide range of schemes

has been proposed to solve the problem, and in this thesis we focus our attention on a series

of schemes collectively referred to as the Bayesian Learning Automaton family, originally introduced

by O-C. Granmo with the Bayesian Learning Automaton deigned for Bernoulli distributed

reward.

The Bayesian Learning Automata rely upon Bayesian statistics and the use of conjugate prior

distributions with sets of updating rules of hyperparameters associated with these conjugate

prior distributions. These updating rules make it possible to apply Bayesian statistics in an

automaton in a simple and highly computationally efficient manner.

In this thesis we extend the Bayesian Learning Automaton family with three new members

designed for Poisson and normally distributed reward.

We empirically evaluate both the Bayesian Learning Automaton family and competing schemes

in the general k-armed bandit problem with Bernoulli, Poisson and normally distributed reward.

We also empirically evaluate these players in the Goore game, iterative prisoners’ dilemma and

two-player zero-sum game from game theory.

Through extensive experiments we show that the Bayesian Learning Automaton family overall

outperforms all other comparable learning schemes in the k-armed bandit problem, and the

Bayesian Learning Automata are among the top performers in all the the games they are introduced

to in this thesis. Thus, we believe that the Bayesian Learning Automaton family is an

important addition to the field of bandit playing algorithms and is an important area for further

research.

Masteroppgave i informasjons- og kommunikasjonsteknologi 2009 – Universitetet i Agder, Grimstad

University of Agder