What is the goal of reinforcement learning and what are useful applications?
Learning strategies based on feedback from the environment (reinforcement)
Games like Chess, Tic Tac Toe, Go, Backgammon
What is the MENACE algorithm and how does it work?
MENACE is playing algorithm for the game of Tic Tac Toe
287 Matchboxes (one for each position)
Beads in 9 different colors (one for each square)
Select the matchbox corresponding to the current position
Randomly draw a bead from this matchbox
Play the move corresponding to the color of the drawn bead
Game lost → drawn beads are kept (negative reinforcement)
Game won → put the drawn bead back and add another one in the same color to this box (positive reinforcement)
Game drawn → drawn beads are put back (no change)
No regard of the credit assignment assignment
What is the credit assignement problem?
The learner normally knows whether it has won or lost not before the end of the game
The learner does not know which move(s) are responsible for the win / loss
What are different examples for machine learning programs in the field of games?
TD-Gammon –> learned to play Backgammon
MENACE –> learned to play Tic Tac Toe
Knight Cap –> learned to play Chess
What is the problem with games of chance and how can it be applied to machine learning ?
Problem –> one does not know what the opponents legal moves are in the future so Minimax is not longer applicable
Solution –> standard game trees are extended with chance nodes (A node where the successor is determined not by the choice of moves of one the opponents, but by a chance event)
How can we approximate expected values in a game of chance and what is a synonymous technique in Backgammon?
The higher N the better is the approximation
Roll out: Play the position many times, each time with a different (random) dice roll
What is simulation search and how does it work?
Instead of searching until a certain depth for all nodes (like Minimax) alternatively, we can limit the breadth of the search tree
estimate the expected value of each move by counting the number of wins in a series of complete games
at each chance node select one of the options at random
We need a fast algorithm for making the decisions at each node in the roll-out games
Often works well even if the move selection is not that strong
Is easily processed parallel
What is Monte-Carlo search and how does it work?
Monte-Carlo search is a extreme case of simulation search
play a large number of games where both players make their moves randomly
average the scores of these games
make the move that has the highest average score
Monte-Carlo search can also be used for deterministic games not only on games of chance
What is the multi-armed bandit problem?
There are several one-armed bandit machines
Each of them has a different average reward/loss
How can we figure out which machine to play, if we want to lose as less money as possible/win as much money as possible
What algorithm gives the optimal strategy for the mult-armed bandit problem?
The Upper Confidence Bound (UCB) algorithm
Exploitation: Pull the best arm in order to maximize your earnings
Exploration: Try if other arms are more promising
How is a Monte-Carlo search tree search constructed and how can it be used?
Monte-Carlo Search can be integrated with conventional game-tree search algorithms
incrementally build up a search tree
evaluate the leaf nodes via roll-outs
evaluate promising moves more often than bad moves
MCTS uses two poicies:
tree policy –> how is the tree traversed in order to find the best outcome
rollout policy –> how are the moves in the roll-out games selected
The tree grows deeper in regions of good moves, since these are chosen more often
What is UCT-search and how does it work?
UCT Search is a variation of UCB
It randomly adds a selected node to the game tree
First perform one iteration of Mone-Carlo Search
Then use backpropagation and adjust the values for each node
Select the move which has been visited most often
UCT Search caused breakthrough in Go
What is the history of computers playing Go?
In the early 2000s Computer Go programs could not beat a child
But then there was Alpha Go and it defeated a human champion for the first time –> this was not expected
How does the algorithm of Aplha Go work?
Aplha Go combined Monte-Carlo Search, Deep Learning and Reinforcement Learning
used a (fast) learned roll-out policy instead of random sampling
use a depth-limit in MCTS where a learned evaluation
function is used instead of real game outcomes
What are the four key learning steps of Alpha Go?
roll-out policy –> learned from expert games, inaccurate but fast
expert policy –> learned from expert games, accurate but takes a long time to compute
selection policy –> refine expert policy with reinforcement learning
utility function –> using self-play from selection policy
What is Alpha Zero and how does it work?
improved version of Alpha Go that learned to play only from self-play
Brought improvement not only to Go but also to games like Chess
What is opponent modeling?
try to predict the opponent's next move
try to predict what move the opponent predicts that your next move will be and so forth
What is the regret in the context of the multiarmed-bandit problem?
Difference between your winnings and the winnings you would have got if you had always played the (unknown) optimal arm.
What are the two types of learned networks?
Policy networks –> using move probabilities
Value networks –> using Evaluation functions
What is the key idea of reinforcement learning?
Learning of policies (action selection strategies) based on feedback from the environment (reinforcement)
How do MENACE and TD-Gammon work?
Learns to play Tic-Tac-Toe and uses 287 matchboxes filled with beads in 9 different colors (1 color for each square). While playing, one selects the matchbox of the corresponding position, draws a bead from the box and plays the move indicated by the bead. If the game is lost, all drawn beads are kept (negative reinforcement), if the game is drawn the beads are put back and if the game is won the beads are doubled when put back (positive reinforcement)
TD-Gammon: Learns to play backgammon. Improvements over MENACE:
faster convergence because of Temporal Difference (TD) Learning - backpropagating every move
NN used instead of matchboxes - allows generalization
use of positional characteristics as features
What is the exploration vs. exploitation trade-off?
Example on multi-armed bandit problem: Pull the best arm in order to maximize your earnings (exploitation) vs. try if other arms are more promising (exploration)
solved by UCB (upper confidence bound) algorithm consisting of:
exploitation term - average earning of arm i
exploration term - inverse probability of how often arm i has been played
always play the arm with highest UCB
Why does deep search not work with chance games?
Player A cannot directly maximize their gain because they don’t know what B’s legal actions will be. Standard game trees are therefore extended with chance nodes - an expected value (average) of the outcome needs to be computed at each chance node. A deep look ahead is not feasible since the probability of reaching a given node shrinks with increasing depth.
How can you approximate an expected value at a position?
If exact probabilities for the outcomes at the chance nodes are not known, values can be sampled.
we can limit the breadth of the search tree and sample some lines to the full depth
estimate the expected values of each move by counting the number of wins in a series of complete games
at each node select one of the options at random (according to probabilities) at MAX and MIN nodes make move
What is the Multi-Armed Bandit problem?
Find the arm (move) with the highest long-term utility from observations of immediate utilities with as little regret as possible
What is Monte-Carlo Search?
Extreme case of simulation search - play a large number of games where both players make their move randomly, average the scores of these games and make the move that has the highest average score. Can also be used in deterministic games.
How can Monte-Carlo Techniques be integrated with Game Tree Search?
Monte-Carlo Tree Search (MCTS):
incrementally build up a search tree, evaluate the leaf nodes via roll-outs, evaluate promising moves more often than bad moves but give bad moves also a chance (UCB algorithm) - tree grows deeper in regions of good moves
uses two policies
Tree Policy: how is the tree traversed in order to find the best leaf to be expanded
Roll-out/Default Policy: how are the moves in the roll-out games selected
UCT Search - the best-known formulation of MCTS, it combines a UCB-based tree policy with random roll-outs.
How does AlphaGo work?
AlphaGo combines MCTS with deep learning and reinforcement learning from self-play.
use a (fast) learned roll-out policy
use a depth-limit in MCTS where a learned evaluation function is used
learn from expert games a fast but inaccurate roll-out policy
learn from expert games an accurate expert policy
refine the expert policy into a more accurate selection policy
use self-play from the selection policy to train a utility function
What is the key (name giving) difference between AlphaGo and AlphaZero?
AlphaGo Zero started without any knowledge and from random behavior
What is the difference between optimal and maximal play?
For simple games we know optimal solutions, but optimal solutions are not maximal. Mind the difference between maximizing your winning (exploit your opponents’ weaknesses) and playing optimal.
Opponent Modeling - try to predict the opponent’s next move, try to predict what move the opponent predicts that your next move will be,…