What elements are there in reinforcement learning?
agent
environment
action u_t
state x_t
rewards r_t
What is the cyclic structure of RL?
What is an agent in RL?
decision taking unit
-> in our case computer executing policy/Strategy
What is the environment in RL?
everything outside the agent
-> theory: universe
usually sufficient to only consider small part e.g. space in proximity of agent…
What is a reward in RL?
scalar signal that agent receives
-> depends on how it is performin gin environment
-> in our case: we design what is rewarded
What is a state in RL?
a signal describing the environment
-> e.g. positions and velocities of limbs of a robot…
often assumed to be markovian (full information)
What is an action in RL?
sth the agent decides on
-> based on the given policy
What is the goal of RL?
train agent in a way
that it receives as much rewards as possible…
What is a probability mass function?
gives probability that some discrete random variable
is exactly equal to some value
e.g. Dice:
P(x=n) = 1/6 for n element {1,…,6}
How to calculate expected value?
sum over random varibales
value of random variable * probailty that this random vairable shows…
What are some useful relations?
P(x) = Sum over i P(x,yi)
P(x,y) = P(x|y) P(y)
When is a state markov?
P(x(t+1)|x(t)) = P(x(t+1)|x(t), x(t-1), …, x(0))
=> current state only depends on directly previous state
=> markov has no memory…
What is a markov process?
sequence of random states with markov property
-> basically: state machine with probabilities to other state
-> as we only have individual states (no two times the same state with different edges) -> has no memory…
What are markov decision processes?
markov process
with additional rewards
and the possiblity to afect transition probabilties
What is the goal of markov decision processes?
find strategy which maximizes future rewards
i.e. find probabilities (traversal probabilities)
with condition that all outgoing probabilities of each state sum up to 1
maximizing rewards
=> sum over rewards over all timesteps (in our example this was infinity)
sum t gamma^t * r(t)
=> gamma is discount factor
What is the discount factor?
gamma can be between [0,1]
due to having gamma^t
-> if gamma smaller 1, then immediate rewards become more important, as in the future, rewards will be more and more discounted…
What is an additional reason to use reward factor?
markov decision process does not terminate (uness we have final state)
-> infinite sum for maximizing reward
-> discount ensures that sum converges…
What is the variable notation for markov decision processes?
thus, action u(t) ~ pi(u|x(t))
x(t+1) ~ f(x|x(t), u(t))
=> ~ probability of…
What are assumptions in markov decision process?
policy and behavior of environment
are discrete probability distributions
x is markovian state
Interpretation pi(u(i)|x(i))
probabilty that i take actoin i
when i am in state i
What are splits in actions in the markov decision process?
e.g. have policy probabiilty pi(ui|xi)
-> outcome can depend on f -> probability of environment state…
f(x|xi, ui)
-> probabiilty for different outcome states depending on previous state and action…
How do we denote things in the lecture?
replace pi(…)
-> simply with action we take
as xi is implicitly clear based on the graph…
What is a grid world?
representation of finite markov decision process with fixed start and finite state
-> each state represented by rectangular in grid
-> can move in all directions
when other rectangular there -> move to this
when “wall” (no rectangular/state) -> means no movement…
each state is assigned a reward
-1 for green and grey state
-2 for red state
goal: get to final state with maximal reward (minimal penalty…)
What is the value funciton
function depending on state and policy
=> function returns expected future reward
starting in state x and then always following policy pi
=> basically: when we have a fixed policy and start in x -> what is our expected reward?
=> can e.g. be used to evaluate current policy…
What is the action value funciton
functoin depending on state, next action and a policy
expected future reward when starting in x, next action being u and then following fixed policy
=> allows comparing next actions based on fixed state and policy
How to calcualte valie function?
(last one is Bellman equation)
polcy basierter erwartungswert von
-> nächstem reward + discount factor mal value function von nöchstem wert…
bedingt auf warhscheinlichkeit dass wir gerade in dem entsprechenden state sind
=> rekursiv…
How to calculate action value function?
erwartungswert von
nächstem zustand
plus discount factor mal dem der action value function vom nächsten zustand und der policy auf dem nächsten zustand (entspricht value fuinction auf nächstem zustand…)
bedingt dass wir in entsprechendem zustand sind und entsprechende action wählen
auch rekusriv…
What is the workflow of polcy improvement?
start at random policy
evaluate policy (i.e. calculate value function)
improve policy
goto 2
How can we calculate the value function in discrete space (grid world)?
assumptino: random strategy (direction probability always 0.25)
Again intuition value funciton?
sum over each action
probability of taking this action
*
reward of this actoin
+
discount factor times value function of the next state
=> when e.g. having environment probabiltes…
-> then split by:
probability of taking this action *
(probabiltiy of first next state *
reward of this state + discoun * VF of next state…
…
probabilty of n-th next state + discoutn * VF of n-th state
Again intuition calculating Q(x,u)?
leave policy out
-> we have fix policy thus..,.
-> reward for next state + discount* value function of next state
or in case we have different environment probabiltiies…
-> sum of these probabilities times the reward plus discount and value function…
How can we calcualte the value fucntion in the grid workld?
assume: given rewards for each field
-> simply sum up:
probabiltiy of each following state (1/4)
times full reward (-1 for grey and green, -2 for red… + reward of this filed (given or current field for wall…)
=> replace current value with this… until grid converges…
How can one solve the markov decision process for small number of states?
simpyl note down the equations of each value funciton
-> yields LGS with fix number of unknown variables
-> solve …
=> using bellman…
How can we improve the policy after doing the policy evaluation? (greedy)
create new deterministic policy
choosing in every state the action with the most expected future reward
according to the old V(x)
=> e.g. in grid world
-> choose direction to least negative neighbor…
What must not be forgotten in policy improvement?
take discount into consideration when evaluating for the next step…
=> best result = reward + discount * next value funciton…
Does policy iteration guarantee optimality? Are ther more efficient alternatives?
yes…
yes… => for simple MDPs with known transitions there are much more efficient algos…
What is the approach of the generalized policy iteration? Advantage?
reduce compuation time
=> not always necessary to let value function converge…
=> make earlier improvements to policy
=> do partial policy evaluation
What is another way to improve policy iteration (consider Q…)
calcualte action value function(s) for each state
=> e.g. in grid world -> yields four values for each state
=> allows to decide, which next aciton to take…
=> as it is interpreted as: what reward can i expect when i do x…
What are advantages of using Q funciton instad of value function?
directly contains all information we need to do policy improvement
=> no need to know the transition probabilities…
ALLOWS FOR US TO NOT KNOW THE ACTUAL MODEL TRANSITION PROBABILTIEIS…
What are disadvantages of using Q funciton instad of value function?
more memory required and needs more time to be trained
=> more memory as we have matrix with column for each action instead of single vector…
Different optimization problems for policy improvement Q and V based
What is a problem so far with the approach we have?
we need to know transition dynamics (e.g. where do we end up when we take action x…)
What is a solution when we do not actually know the model?
use data from the interacctions with the MDP…
=> assume data was generated by policy
=> randomly do some actions
-
How do we collect inforamtino if we do not know the model?
iterate over all data tuples
(current state, action, reward, next state)
=> get to know how model reacts for each action w.r.t reward and next state…
How are value function and action value function adjusted for the iterative, arndom process…?
we have value specific policy
-> (action) value function initially random…
-> new value function at state x(t)
=
(1-a) * old value function
a * (reward + discount * old value function of next state)
=> intuition:
learning rate alpha
-> decay old informatino (how much reward can i expect at current state)
add new infomation (how much reward can i expect by doing a specific action… -> got this by interaction…)
What is a neccesarry assumption fo r convergence (in model free learning)?
all states and actions have non-zero probabiltiy of being visited
cannot do greedy, as we there have deterministic policy…
learning rate is decreasing
sum over learning rates = infinity
sum over squared learning rates < infinity
we learn an infinite amount of time
=> in practice, assumptions can be relaxed and results will still be good…
How can one do model free learning while still using greedy?
use epsilon-greedy policy
-> do greedy update, but give other actions a small probabiliy too
probabiltiy greedy -> 1-epsilon
-> rest shares epsilon…
Do we actually have to reduce the learning rate?
not necessarily
-> can be
but sometimes keeping constant is enoug
-> i.e. similar to neurla networks…
Do we actually need to train infinitely in model free learning?
no
-> same as policy improvement
=> can stop at some point…
What is the advantage of Q-Learning?
does policy evaluation and improvement at the same time
How does Q-Learning operat?
only uses action value fuinction
=> similar to policy evaluation step
But directly use the best outcome (max_u Q…)
What are some considerations in Q-Learning?
some problems have increadibly many states
-> not practical / possible
combine similar states to generarlize?
are all states relevant? Do have some a very low / 0 probabiltyi?
use NN to find Q Funtion / and or policy on relevant states only and learn to genearlize…
Why not use deterministic strategy for learning?
to allow for exploration of possible action - reward situations
-> and actually learn…
Why should we not use deterministic policies when sampling from markv process?
rist to revisit same states over and over again
need all states in order to learn value funciton
all states must have non-zero probabilities of being visited in order to learn complete value function
What is q-learning?
algoritm based on action value function
algo for solving markov decision process through interaction with the process
only guaranteed to find optimal policy for non-zero probabilities for the actoins
Zuletzt geändertvor 2 Jahren