poker

Text-only Preview

www.sciencemag.org/content/347/6218/145/suppl/DC1
Supplementary Materials for
Heads-up limit hold'em poker is solved
Michael Bowling,* Neil Burch, Michael Johanson, Oskari Tammelin
*Corresponding author. E-mail: [email protected]
Published 9 January 2015, Science 347, 145 (2015)
DOI: 10.1126/science.1259433
This PDF file includes:
Supplementary Text
Fig. S1
References and Notes
Additional supplementary material for this manuscript includes the following:
Source code

Supplementary Materials for
Heads-up Limit Hold'em Poker is Solved
The Game of Heads-Up Limit Hold'em
Heads-up limit Texas hold'em is a two-player poker game. It is a repeated game, in which the
two players typically play a match of individual games, usually called hands, while alternating
who is the dealer. In each of the individual games, one player will win some number of chips
from the other player, and the goal is to win as many chips as possible over the course of the
match.
Each individual game begins with both players placing a number of chips in the pot: the
player in the dealer position puts in the small blind, and the other player puts in the big blind,
which is twice the small blind amount. The game then progresses through four rounds: the
preflop, flop, turn, and river. Each round consists of cards being dealt followed by player
actions in the form of wagers as to who will hold the strongest hand at the end of the game. In
the preflop, each player is given two private cards, unobserved by their opponent. In the later
rounds, cards are dealt face-up in the center of the table, called community cards. A total of five
community cards are revealed over the four rounds: three on the flop, one on the turn, and one
on the river.
After the cards for the round are dealt players alternate taking actions from among exactly
three options: fold, call, or raise. A player folds by declining to match the last opponent wager,
thus forfeiting to the opponent all chips in the pot and ending the game with no player revealing
their private cards. A player calls by adding chips into the pot to match the last opponent wager,
which causes the next round to begin. A player raises by adding chips into the pot to match the
last wager followed by adding additional chips to make a wager of their own. At the beginning
of a round when there is no opponent wager yet to match, the raise action is called bet, and
the call action is called check, which only ends the round if both players check. The size of
all wagers are fixed: in the preflop and flop rounds the size is called the small-bet and is equal
to the big blind; on the turn and river rounds the size is called the big-bet and is twice the big
blind. The non-dealer takes the first action in each round except in the pre-flop where the dealer
must decide whether to fold, call, or raise the non-dealer's big blind bet. A maximum of four
bets or raises are allowed in each round, with the initial blinds counting as a single bet in the
pre-flop. Once the fourth bet or raise is made, the responding player may only call or fold.
1

If the river round ends with no player previously folding to end the game, the outcome is
determined by a showdown. Each player reveals their two private cards and the player that can
form the strongest five-card poker hand (see "List of poker hands" on Wikipedia; accessed July
21, 2014) wins all the chips in the pot. To form their hand each player may use any cards from
their two private cards and the five community cards At the end of the game, whether ended by
fold or showdown, the players will swap who is the dealer and begin the next game.
Since the game can be played for different stakes, such as a big blind being worth $0.01
or $1 or $1000, players commonly measure their performance over a match as their average
number of big blinds won per game. Researchers have standardized on the unit milli-big-blinds
per game, or mbb/g, where one milli-big-blind is
1
of one big blind. A player that always
1000
folds will lose 750 mbb/g (by losing 1000 mbb as the big blind and 500 as the small blind).
A human rule-of-thumb is that a professional should aim to win at least 50 mbb/g from their
opponents. Milli-big-blinds per game is also used as a unit of exploitability, when it is computed
as the expected loss per game against a worst-case opponent.
Poker Glossary
big-bet The size of all bets or raises in the 3rd or 4th round. A big-bet is twice the size of a
small-bet.
big blind Initial wager made by the non-dealer before any cards are dealt. The big blind is
twice the size of the small blind.
bet Starting a new wager in a round, putting more chips into the pot.
call Putting enough chips into the pot to match the current wager; ends the round.
cap Make the fourth bet or raise in a round. In the first round, the blinds are considered the
first bet.
check Declining to start a new wager in a round.
chip Marker representing value used for wagers, comes in various denominations including the
small blind, the big blind, and possibly others.
community cards Public cards, dealt face up, visible to all players. Used in combination with
hole cards to create a hand.
dealer The player who puts the small blind into the pot. Acts first on round 1, and second
on the later rounds. Traditionally, they would distribute community cards and hole cards
from the deck.
flop The 2nd round; can refer to either the 3 revealed community cards, or the betting round
after these cards are revealed.
2

fold Give up on the current hand, forfeiting all wagers placed in the pot. Ends a player's
participation in the game.
game A single playing of poker. Ends after all but one player folds or a showdown is reached.
hand Many different meanings: the combination of the best 5 cards from the community cards
and hole cards, just the hole cards themselves, or a single game of poker (for clarity, we
avoid this final meaning).
hole cards Private cards, dealt face down, visible only to one player. Used in combination with
community cards to create a hand.
limp Start the game by calling the big blind rather than raising.
milli-big-blinds per game (mbb/g) Average winning rate over a number of hands, measured
in thousandths of big blinds.
pot The collected chips from all wagers.
preflop The 1st round; can refer to either the hole cards, or the betting round after these cards
are distributed.
raise Increasing the size of a wager in a round, putting more chips into the pot than is required
to call the current bet.
river The 4th and final round; can refer to either the 1 revealed community card, or the betting
round after this card is revealed.
showdown After the river, players who have not folded show their hole cards to determine the
player with the best hand. The player with the best hand takes all of the chips in the pot.
small-bet The size of all bets or raises in the 1st or 2nd round. A small-bet is equal in size to
the big blind.
small blind Initial wager made by the dealer before any cards are dealt. The small blind is half
the size of the big blind.
stakes The size of the blinds in a match of poker games.
turn The 3rd round; can refer to either the 1 revealed community card, or the betting round
after this card is revealed.
3

The History of Solving HULHE
Poker has been used as a research testbed for artificial intelligence and game theory for over
60 years. Over the last decade, the game of HULHE has emerged as a standard variant for
evaluating and comparing artificial intelligence algorithms for playing imperfect information
games. Our solving of HULHE builds on eleven years of research focused on the game.
The research community began to focus on HULHE as a challenge problem in 2003, when
the Computer Poker Research Group at the University of Alberta created the first strategy in-
tended to approximate a Nash equilibrium for the game (24). Using the Sequence Form Linear
Programming technique of Romanovskii (28) and Koller et al. (29, 30), HULHE was far too
large to be solved directly. Instead, Billings et al. employed the abstraction principle first
explored by Shi and Littman (32). This approach has now been formalized as the Abstraction-
Solving-Translation approach and underlies most of the subsequent game theoretic programs
applied in the poker domain. First, the full-scale poker game is modelled by a smaller, tractably-
sized abstract game. Second, this abstract game is solved to find a close approximation to an
abstract game Nash equilibrium. Third, as needed, this abstract strategy is translated to choose
actions in the real game. In order to produce an abstract game that can be tractably solved,
the abstraction process cannot perfectly model the real game, and the resulting strategy will be
exploitable in the real game. It is not currently known how exploitable this first game theoretic
strategy for HULHE by Billings et al. is; however, as it was soundly defeated in games against
later strategies, it was demonstrably not a very close equilibrium approximation for the real
game.
In 2006, research groups at the University of Alberta and Carnegie Mellon University co-
ordinated to form the Annual Computer Poker Competition (ACPC), an event held in conjunc-
tion with the Association for the Advancement of Artificial Intelligence (AAAI) conference.
HULHE was the founding event in 2006, and has been played in every year of the ACPC. Since
2006, the HULHE events have been entered by 59 research groups and a total of 120 competing
programs. Until 2013, the competition ran two HULHE events: the Bankroll event in which
players aim to maximize their total winnings against the field of opponents, and the Instant
Runoff event in which losing players are iteratively eliminated. This second competition is well
aligned with the goal of solving the game, as a Nash equilibrium would not lose on expectation
against any opponent, and so would at worst tie for first place.
Since the inaugural event of the ACPC, researchers have developed a succession of game
solving algorithms that allow for ever-larger games to be solved (36,40,52,54-58,62). The abil-
ity to solve a larger game means that a finer-grained abstraction can be used, which more closely
models the real game of HULHE. The resulting larger abstract game strategies have typically
outperformed earlier strategies based on necessarily smaller, coarser abstractions. However,
there is no theoretical guarantee that finer-grained abstractions will result in a better approxi-
mation of a Nash equilibrium, and, in fact, counterexamples exist (59).
In 2011, Johanson et al. (41) developed an accelerated best response computation that, for
the first time, made it feasible to measure a HULHE strategy's approximation quality by effi-
4

ciently computing its exploitability. This allowed researchers to revisit the previous five years
of efforts and measure the progress made over time towards approximating a Nash equilibrium.
In Figure S1, we show the exploitability of the known least exploitable poker programs in each
year from 2006 to 2013, culminating in the CFR+ solution developed in this work. Four of
these strategies are noteworthy. The strategy shown in 2006 was for a technique that preceded
CFR, called Range of Skill (62). In 2007 and 2008, the University of Alberta hosted two Man-
vs-Machine Poker Championships in which their agent, Polaris, competed against human poker
professionals. In 2007, Polaris narrowly lost its match against Phil Laak and Ali Eslami. In
2008, an improved version of Polaris narrowly won against a team of professionals that in-
cluded Matt Hawrilenko, a HULHE specialist. Although Polaris was competitive with human
professionals, we now know that it was still quite far from optimal play, being exploitable for
275.9 mbb/g and 235.3 mbb/g respectively.
With the development of the accelerated best response computation, several ACPC com-
petitors cooperated with the University of Alberta in 2011 to measure the exploitability of their
2010 agents. While the University of Alberta's agent Hyperborean placed third in the 2010 com-
petition, it was the least exploitable of the evaluated programs at 135.4 mbb/g. The first-place
program Rockhopper, developed by David Lin, was the most exploitable of the top three pro-
grams at 300.0 mbb/g, and second-place GGValuta, developed at the University of Bucharest by
Mihai Ciucu et al., was exploitable for 237.3 mbb/g. This experiment served both to benchmark
the progress of the poker research community, and to note the potential inconsistency between
winning competitions and minimizing exploitability prior to achieving the level of essentially
solved.
Finally, with the introduction of CFR+ in 2014 by Tammelin (39) and our joint effort to use
it to solve HULHE, we have concluded this 11-year effort to solve the game. Since CFR+ allows
us to scale to the size of HULHE and solve the game directly, it does not require any of the ab-
straction and translation techniques that have historically been necessary for earlier algorithms
to imperfectly approximate a Nash equilibrium. As the ACPC typically only achieves statistical
significance of 5-10 mbb/g, the strategy described in this paper would in the worst-case tie for
first in any Instant Runoff event.
The CFR+ Algorithm
The CFR+ algorithm is a variant of counterfactual regret minimization (CFR). Here we formally
describe CFR before describing the modifications for CFR+. We begin by giving mathematical
notation to the extensive form game formulation from the main text.
Notation
An extensive form game is a tuple N, H, A, Z, P, uiN , IiN . N is the set of players. The
game tree is represented as a set of histories, H. Each individual history h H is a game state
5























Figure S1: The decreasing exploitability of computer poker strategies for HULHE. Notable
datapoints are described in the text.
6

and defined as a sequence of actions from the set A. H forms a tree so that if ha H then
h H, and is called the parent of ha. We define h
h to mean that the history h is a proper
prefix of the history h , or equivalently h is an ancestor of h in the game tree. Z H is the
set of terminal histories (i.e., leaves of the tree), where z Z is not a parent of any history.
P (h) N {c} is a function assigning a player to every non-terminal history h H \ Z,
specifying who selects the next action; where c is the chance player and c(a|h) is the fixed and
known probability that chance selects action a at history h. The functions ui(z) R specify
the utility to player i N for every terminal history z Z, and is required to be bounded,
so |ui(z) - ui(z )| for all i N and z, z Z. Finally, Ii is a partition of the set of
histories Hi {h H|P (h) = i}. A block of the partition I Ii is an information set, where
two histories are in the same information set if and only if player i cannot distinguish between
the two histories. For any history h Hi, define Ih Ii to be the information set containing
history h, and for any history h Hc, define Ih h.
A game is two-player zero-sum if N = {1, 2} and u2(z) = -u1(z) for all z Z. A
game has perfect recall if and only if for all histories h, h in the same information set I for
player i, if we let pa
h and p a
h be the longest prefixes of h and h (respectively) such
that P (p) = P (p ) = i, it must be that a = a and {p, p } I for some information set
I Ii. In other words, if two histories cannot be distinguished by a player, then the last time
the same player acted, they must be similarly indistinguishable; or more simply, a player cannot
forget what they previously knew. CFR is guaranteed to converge to a Nash equilibrium in any
two-player, zero-sum game with perfect recall (36), and has been observed experimentally to
produce strong strategies in some non-two-player (60), non-zero-sum (41), and imperfect recall
games (61).
A (behavioral) strategy for player i is denoted i, where i(a|I) gives the probability of
player i choosing action a at information set I Ii under strategy i. Let i denote the set of
such strategies, and xiN i denote a strategy profile, i.e., a strategy for each player. Let
(i, - ) be a strategy profile where player i follows their strategy in and the other players
i
follow their strategy in . Finally, for any information set I Ii, define : I a to be the
strategy profile where player i selects the action a with probability 1 in information set I Ii,
but in all other histories (and all other players) choose actions according to .
Define (z)

ha z
P (h)(a|I h) to be the probability of reaching terminal history z when
players choose actions according to . We can further break down this product into one player's
contribution, (z)

i
ha z:P (h)=i
i(a|I h), and all other players' (including chance's) contri-
bution, - (z)

(z)u
i
ha z:P (h)=i
P (h)(a|I h). Now, we can write ui() =
z
i(z) to be
the expected utility to player i when players follow profile . Furthermore, we can describe an
-Nash equilibrium as a strategy profile where,
ui() ui( , ) -

i
-i
i
i
i N.
(1)
7

Counterfactual Regret Minimization (CFR)
Let be any strategy profile, the counterfactual value of player i taking action a at information
set I is defined as,
v(I, a)
u
(z):Ia(h, z).
(2)
i
i(z)
-i
i
hI zZ:
h z
In other words, it is the expected utility to player i of reaching information set I and taking
action a, under the counterfactual assumption that player i takes actions to do so, but otherwise
player i and all other players follow the strategy profile . Let {1, . . . , T } be a sequence of
strategy profiles. The counterfactual regret of player i for action a at information set I is then
defined as,
T
T
RT (I, a)
vt(I, a) -
vt(I, a )t(a |I),
(3)
i
i
i
i
t=1
t=1 a A
i.e., the amount of additional counterfactual value that player i could have attained in expecta-
tion if action a was chosen deterministically rather than the actual sequence of strategies chosen.
Counterfactual regret minimization (CFR) is an iterative self-play algorithm where the play-
ers at every information set follow a regret-minimizing algorithm to choose actions. The com-
mon choice for such an algorithm is regret-matching. In regret-matching the player i defines
their strategy at time t, t, as a function of the sequence of strategy profiles up to time t - 1,
i

(Rt-1(I,a))+
i
+

if
Rt-1(I, a )
> 0
t(a|I) =
(
a A
i
Rt-1(I,a ))+
,
(4)
i
a A
i
1

otherwise
|A|
where (x)+ max(0, x). By following regret-matching, the following is guaranteed,
RT (I, a)
|A|T ,
(5)
i
i.e., the counterfactual regret at every information set grows sublinearly with the number of
iterations, T .
Given a sequence of strategy profiles we are particularly interested in a player's overall
regret given the sequence. Overall regret is defined as,
T
T
RT max
u
)
-
u
i
i(i, t-i
i()
(6)
ii
t=1
t=1
i.e., the amount of additional utility that player i could have attained in expectation if they had
chosen the best fixed strategy in hindsight rather than the actual sequence of strategies chosen.
For a game with perfect recall, this quantity can be bounded by CFR's per-information-set
counterfactual regrets,
RT
max RT (I, a) |I
i
i
i|
|A|T .
(7)
aA
IIi
8

CFR is therefore a regret-minimizing algorithm (i.e., achieves sublinear regret) for extensive
form games.
The regret-minimizing property can be leveraged to make CFR into a game solving algo-
rithm. It is well-known that in a two-player zero-sum game with perfect recall if RT
for all
i
i {1, 2}, then the average strategy profile
is a 2 -Nash equilibrium. Each players' average
strategy,
i has to be constructed so that it achieves the same expected utility as the uniform
mixture of strategies (1, . . . , T ). This can be done by combining the strategies in the mixture
i
i
to select an action at an information set in proportion to that strategy's probability of playing to
reach that information set, so,
T
t(h) t(a|I)


t=1
hI
i
i(a|I )
,
(8)
T
t(h)
t=1
hI
i
where the term
t(h) is player i's contribution to the probability of reaching a history in
hI
i
information set I, and is thus the weighting term on t. By running CFR for enough iterations,
i
can be driven arbitrarily small.
This presentation described vanilla CFR, which exhaustively computes exact counterfactual
values and regrets for every pair of information sets and actions on each iteration. The literature
contains a number of variants that employ sampling to compute noisy estimates of the counter-
factual values and regrets. They trade off faster iterations for a larger number of iterations, often
resulting in considerably reduced overall time. CFR+ is built around exhaustive iterations, and
so we do not discuss sampling any further.
CFR+
Now that we have described CFR, CFR+ can be described quite succinctly: replace regret-
matching with regret-matching+. Regret-matching+ replaces Equation 4 for choosing the strat-
egy. The algorithm instead maintains alternative counterfactual regret values Qt(I, a) updated
as follows,
Q0(I, a) = 0
+
Qt(I, a) = Qt-1(I, a) + Rt(I, a) - Rt-1(I, a)
(9)
The value Qt(I, a) is updated with the same change to the actual counterfactual regrets, except
if this update gives a negative result then it is thresholded at zero. This prevents these alterna-
tive counterfactual regret values from ever becoming negative. Furthermore, any future positive
regret changes will immediately add to these alternative values rather than cancelling out accu-
mulated negative regret. The CFR+ algorithm then selects actions according to regret-matching
using these values,
Qt(I,a)
i
if
Qt(I, a ) > 0
t+1(a|I) =
Qt(I,a )
a A
i
a A
i
,
(10)
i
1
otherwise
|A|
9

Document Outline

  • SOM.page.vol-346.pdf
    • Heads-up limit holdem poker is solved
  • 1259433BowlingRefs.pdf
    • References and Notes