Adersarial search setting

two player

turn taking

deterministic

fully observable

time constraints

zero-sum

turn taking, zero-sum, two player games, both players behave rationally

fully vs partly obserable

fully-observable:

each player has all information available -> both players can compute the game theoretic best move -> strategy of opponent is known

partly-observable:

players might miss information -> can not predict opponents move -> strategy of opponent is not known

Considered algorithms

MINIMAX with alpha-beta-pruning

Monte-Carlo Search, Monte-Carlo Tree Search

Zero-sum

rewards/rating of players sum up to zero

if on player wins, the other loses

simplification

given score of player1, we can compute score of player2

we only consider the score of player1 which is MAX

player1 maximizes MAX

player2 minimizes MAX

Time constraited

assumption:

in the given time, only a small fraction of the state space can be explored

Search formulation

inital state s

actions in state: actions(s)

player: player(s)

trasition model: result(s, a)

terminal test (true if game is over)

evaulate how good player is in state: evaluate(s, p)

Example grame tree

Minimax idea

from current state (root), perform DFS into depth h (horizon)

each time a leave is reached:

evaluate leaf

backup vaules to the root:

MAX node: select maximum value of successors

MIN node: select minimum value of successors (intution: opponents plays as good as possible

once time is over: select successor from root with largest value

Practical Minimax

for big game trees, the search takes very long

algorithm would be stuck in the subtree of first action

improvement: use iterative deepening to search the tree

Evaluation function e(s)

maps state s to value

e(s) > 0 -> good for MAX

e(s) < 0 -> bad for MAX, good for MIN

Typical e(s)

weighted sum of features

f_i might be:

number of possible moves

number of pieces

Backedup values

at each node, the backed up value represents the value of the best state the player can reach from this subtree

Minimax evaluation

Minimax is guranteed to compute the game theoretic optiamal move, if

zero sum

two players

search fully explores the game tree

Minimax for a game which is non zero-sum

zero-sum games are competitive: maximizing my own values implies minimizing the value of the opponent

a non zero-sum game is for example a cooperative game

then, the opponent should not minimize its values because this result in subopitmal results

Minimax for non turn-taking games

in a non turn-taking game (e.g. rock-paper-scissor), the search yields suboptimal results

this is, because based on the choice of the first-player, the second player can always win

Minimax for more than two players in a zer-sum game

e.g. consider a three-player game

in the setting of minimax player3 must either maximize or minimize the value

if p3 maximizes, the result it too optimistic

if p3 minimizes, the result is too optimistic

in fact, p3 wants to maximize its own score, which can be realized either by minimizing p1 oder p2

Minimax for partly observable environments

parts of the game are not known by the algorithm (e.g. cards of opponent)

minimax can not determine best strategy of opponent

thus, the computed values for MIN nodes might be

too optimistic, e.g. move of opponent can not be as good as assumed

too pessimistic, e.g. move of opponent can be better than assumed

Observation

Search tree is huge, we can not go deep into the tree in limited time

Idea: try to prune parts of the tree -> alpha-beta pruning

Alpha-Beta-Pruning graphical

MAX already found move leading to 0

MAX will not go into the branch -3

The branch with value -3 will not increase, because MIN tries to minimize this number

Next subtree under -3 can be ignored

beta(N) = -3 <= 0 alpha(A)

MIN already found move leading to 0

MIN will not go into branch 3

Branch 3 will not decrease, because MAX tries to maximize it

Next branch under 3 can be ignored

alpha(N) = 3 >= 0 alpha(A)

Alpha-Beta-Pruning mathematical

Discontinue search below node N:

MIN: if beta(N) is <= alpha(A) of any MAX ancestor

MAX: if alpha(N) is >= beta(A) of any MIN ancestor

When does alpha-beta-pruning reaches its optimum?

best nodes first for each player:

MIN children of MAX node are ordered in decreasing backup values

MAX children of MIN node are ordered in increasing backup values

BUT:

if we would already know how to order the nodes, there would be no need to search them

Alpha-Beta-Pruning evaluation

worst-case: O(b^h) -> no beneifts

best-case: O(b^(h/2))

random ordering: O(b^(h*3/4))

-> node ordering is essential

Improvements

practical ordering: use backup values of previous iteration to order nodes

transposition tables to handle revisited states

Obervation

MINIMAX requires some sort of evaluation function, which might be hard to come up with

Idea: random simulation of moves -> Monte-Carlo

result of random moves converge over time

this then yields the best action to take

Monte-Carlo Search

recursive implementation

random DFS from current node until leaf is reached

evaluate leaf leave

compute average values

choose action with highes average value (average values are computed for first-level actions)

Monte-Carlo Search evaluation

Good:

easy to implement

low memory complexity

no game specific knowledge needed -> no heuristics needed

handles uncertainty

Bad:

problems of beeing too optimisitc/pessimistic

does not terminate

while MINIMAX computes the game theoretic best move, MCS does not produce correct results

Monte-Carlo Tree Search

compute average values for all nodes in tree

advantage: reuse of values in next iteration

Further improvment: UCT

idea: use stored average values to guide exploration

relies again on heuristics

speed up convergence of algorithm

MCTS UCT

Phases:

Select leaf of current tree

based on UCB1 heuristic

goal: balanced exploration/exploitation

Expand the node

Playout: random simulation until game is terminated

Backup: update the values in the current tree

exploraiton vs exploitation

UCB1 tries to balance exploitation and exploration of MCTS

exploitation: use information already discovered

select moves with highest average values

too much: suboptimal results because missing moves

exploration: find new information

select moves which have not been tried so far

too much: slow convergance

Last changeda year ago