What is the goal of reinforcement learning and what are useful applications?

Learning strategies based on feedback from the environment (reinforcement)

Games like Chess, Tic Tac Toe, Go, Backgammon

Elevator Dispatching

Robot Control

Job-Shop Scheduling

What is the MENACE algorithm and how does it work?

MENACE is playing algorithm for the game of Tic Tac Toe

Initial position:

287 Matchboxes (one for each position)

Beads in 9 different colors (one for each square)

Playing algorithm:

Select the matchbox corresponding to the current position

Randomly draw a bead from this matchbox

Play the move corresponding to the color of the drawn bead

Learning algorithm:

Game lost → drawn beads are kept (negative reinforcement)

Game won → put the drawn bead back and add another one in the same color to this box (positive reinforcement)

Game drawn → drawn beads are put back (no change)

No regard of the credit assignment assignment

What is the credit assignement problem?

The learner normally knows whether it has won or lost not before the end of the game

The learner does not know which move(s) are responsible for the win / loss

What are different examples for machine learning programs in the field of games?

TD-Gammon –> learned to play Backgammon

MENACE –> learned to play Tic Tac Toe

Knight Cap –> learned to play Chess

What is the problem with games of chance and how can it be applied to machine learning ?

Problem –> one does not know what the opponents legal moves are in the future so Minimax is not longer applicable

Solution –> standard game trees are extended with chance nodes (A node where the successor is determined not by the choice of moves of one the opponents, but by a chance event)

How can we approximate expected values in a game of chance and what is a synonymous technique in Backgammon?

The higher N the better is the approximation

Roll out: Play the position many times, each time with a different (random) dice roll

What is simulation search and how does it work?

Instead of searching until a certain depth for all nodes (like Minimax) alternatively, we can limit the breadth of the search tree

Algorithm:

estimate the expected value of each move by counting the number of wins in a series of complete games

at each chance node select one of the options at random

Properties:

We need a fast algorithm for making the decisions at each node in the roll-out games

Often works well even if the move selection is not that strong

Is easily processed parallel

What is Monte-Carlo search and how does it work?

Monte-Carlo search is a extreme case of simulation search

play a large number of games where both players make their moves randomly

average the scores of these games

make the move that has the highest average score

Monte-Carlo search can also be used for deterministic games not only on games of chance

What is the multi-armed bandit problem?

There are several one-armed bandit machines

Each of them has a different average reward/loss

How can we figure out which machine to play, if we want to lose as less money as possible/win as much money as possible

What algorithm gives the optimal strategy for the mult-armed bandit problem?

The Upper Confidence Bound (UCB) algorithm

Exploitation: Pull the best arm in order to maximize your earnings

Exploration: Try if other arms are more promising

How is a Monte-Carlo search tree search constructed and how can it be used?

Monte-Carlo Search can be integrated with conventional game-tree search algorithms

Construction:

incrementally build up a search tree

evaluate the leaf nodes via roll-outs

evaluate promising moves more often than bad moves

MCTS uses two poicies:

tree policy –> how is the tree traversed in order to find the best outcome

rollout policy –> how are the moves in the roll-out games selected

The tree grows deeper in regions of good moves, since these are chosen more often

What is UCT-search and how does it work?

UCT Search is a variation of UCB

It randomly adds a selected node to the game tree

First perform one iteration of Mone-Carlo Search

Then use backpropagation and adjust the values for each node

Select the move which has been visited most often

UCT Search caused breakthrough in Go

What is the history of computers playing Go?

In the early 2000s Computer Go programs could not beat a child

But then there was Alpha Go and it defeated a human champion for the first time –> this was not expected

How does the algorithm of Aplha Go work?

Aplha Go combined Monte-Carlo Search, Deep Learning and Reinforcement Learning

used a (fast) learned roll-out policy instead of random sampling

use a depth-limit in MCTS where a learned evaluation

function is used instead of real game outcomes

What are the four key learning steps of Alpha Go?

roll-out policy –> learned from expert games, inaccurate but fast

expert policy –> learned from expert games, accurate but takes a long time to compute

selection policy –> refine expert policy with reinforcement learning

utility function –> using self-play from selection policy

What is Alpha Zero and how does it work?

improved version of Alpha Go that learned to play only from self-play

Brought improvement not only to Go but also to games like Chess

What is opponent modeling?

try to predict the opponent's next move

try to predict what move the opponent predicts that your next move will be and so forth

What is the regret in the context of the multiarmed-bandit problem?

Difference between your winnings and the winnings you would have got if you had always played the (unknown) optimal arm.

What is a policy regarding reinforcement learning?

Policy: A strategy for selecting actions

What are the two types of learned networks?

Policy networks –> using move probabilities

Value networks –> using Evaluation functions

What is the key idea of reinforcement learning?

Learning of policies (action selection strategies) based on feedback from the environment (reinforcement)

How do MENACE and TD-Gammon work?

MENACE:

Learns to play Tic-Tac-Toe and uses 287 matchboxes filled with beads in 9 different colors (1 color for each square). While playing, one selects the matchbox of the corresponding position, draws a bead from the box and plays the move indicated by the bead. If the game is lost, all drawn beads are kept (negative reinforcement), if the game is drawn the beads are put back and if the game is won the beads are doubled when put back (positive reinforcement)

TD-Gammon: Learns to play backgammon. Improvements over MENACE:

faster convergence because of Temporal Difference (TD) Learning - backpropagating every move

NN used instead of matchboxes - allows generalization

use of positional characteristics as features

What is the exploration vs. exploitation trade-off?

Example on multi-armed bandit problem: Pull the best arm in order to maximize your earnings (exploitation) vs. try if other arms are more promising (exploration)

solved by UCB (upper confidence bound) algorithm consisting of:

exploitation term - average earning of arm i

exploration term - inverse probability of how often arm i has been played

always play the arm with highest UCB

Why does deep search not work with chance games?

Player A cannot directly maximize their gain because they don’t know what B’s legal actions will be. Standard game trees are therefore extended with chance nodes - an expected value (average) of the outcome needs to be computed at each chance node. A deep look ahead is not feasible since the probability of reaching a given node shrinks with increasing depth.

How can you approximate an expected value at a position?

If exact probabilities for the outcomes at the chance nodes are not known, values can be sampled.

Simulated Search:

we can limit the breadth of the search tree and sample some lines to the full depth

estimate the expected values of each move by counting the number of wins in a series of complete games

at each node select one of the options at random (according to probabilities) at MAX and MIN nodes make move

What is the Multi-Armed Bandit problem?

Find the arm (move) with the highest long-term utility from observations of immediate utilities with as little regret as possible

What is Monte-Carlo Search?

Extreme case of simulation search - play a large number of games where both players make their move randomly, average the scores of these games and make the move that has the highest average score. Can also be used in deterministic games.

How can Monte-Carlo Techniques be integrated with Game Tree Search?

Monte-Carlo Tree Search (MCTS):

incrementally build up a search tree, evaluate the leaf nodes via roll-outs, evaluate promising moves more often than bad moves but give bad moves also a chance (UCB algorithm) - tree grows deeper in regions of good moves

uses two policies

Tree Policy: how is the tree traversed in order to find the best leaf to be expanded

Roll-out/Default Policy: how are the moves in the roll-out games selected

UCT Search - the best-known formulation of MCTS, it combines a UCB-based tree policy with random roll-outs.

How does AlphaGo work?

AlphaGo combines MCTS with deep learning and reinforcement learning from self-play.

use a (fast) learned roll-out policy

use a depth-limit in MCTS where a learned evaluation function is used

learn from expert games a fast but inaccurate roll-out policy

learn from expert games an accurate expert policy

refine the expert policy into a more accurate selection policy

use self-play from the selection policy to train a utility function

What is the key (name giving) difference between AlphaGo and AlphaZero?

AlphaGo Zero started without any knowledge and from random behavior

What is the difference between optimal and maximal play?

For simple games we know optimal solutions, but optimal solutions are not maximal. Mind the difference between maximizing your winning (exploit your opponents’ weaknesses) and playing optimal.

Opponent Modeling - try to predict the opponent’s next move, try to predict what move the opponent predicts that your next move will be,…

Last changed5 months ago