In the Robot Learning module, we explored how robots can use machine learning to tackle complex tasks. We began by working through Sutton and Barto’s book Reinforcement Learning: An Introduction, which provides a thorough introduction to the concepts and algorithms of reinforcement learning and is considered a standard reference in the field. I found it both engaging and accessible, even though I had little prior knowledge of the subject at the time.
For the exam, however, the main focus was on understanding and applying the mathematical foundations and formulas from the book and lectures. To prepare, I created a formula sheet that summarized the key concepts and equations.
Robot Learning Formula Sheet
Fundamentals
Elements of a Reinforcement Learning System
In Reinforcement Learning, an agent interacts with an environment in discrete time steps. At each time step t, the agent receives a state st from the environment and selects an action at based on its policy π. The environment then transitions to a new state st+1 and provides a reward rt+1 to the agent.
Markov Property
The Markov Property states that the future is independent of the past given the present. In Reinforcement Learning, this means the next state and reward depend only on the current state and action, not on the sequence of events that preceded it.
Pr(st+1=s,rt+1=r∣s0,a0,r1…,rt,st,at)
=Pr(st+1=s,rt+1=r∣st,at)
Since the indices in the formula above can be confusing, we can visualize the sequence of states and actions in question like this:
Return
The return at a point t of a trajectory is the sum of the discounted future rewards
Rt=rt+1+γrt+2+γ2rt+3...=i=0∑∞γi⋅rt+1+i
=rt+1+γRt+1
RL Agent goal
The goal of a Reinforcement Learning agent is to find a policy π that maximizes the expected return from the start state s0.
Eπ(R0)
Bellman Equations
The Bellman Equations are fundamental recursive relationships in Reinforcement Learning that relate the value of a state (or state-action pair) to the values of its successor states (or state-action pairs). They are used to compute value functions and are the basis for many RL algorithms.
The Bellman Equation
The Bellman Equation expresses that instead of calculating the discounted cumulative reward by summing
up all future rewards, we can decompose it into the immediate reward plus the discounted value of the next
state.
State-Value function via Bellman Equation
The value of a state is the average return that can be acquired after that state when following a specific policy:
Vπ(s)=Eπ(Rt∣st=s)
Eπ(rt+1+γVπ(st+1)∣st=s)
=a∑π(a∣s)s′∑Pss′a(Rss′a+γVπ(s′))
Action-value function via Bellman Equation
Qπ(s,a)=E(Rt∣st=s,at=a)
=Eπ(rt+1+γVπ(st+1)∣st=s,at=a)
=s′∑Pss′a(Rss′a+γVπ(s′))
Bellman Principle of Optimality
The Bellman principle of optimality states that any part of an optimal policy must itself be optimal for the subproblem starting from that point.
Therefore, the value of a state under an optimal policy, must equal the expected return for the best action from that state. Because subproblems of an optimal solution are themselves optimal: Selecting the next action is a subproblem, so we need to select the optimal action there too.
If a policy is optimal, the remaining decisions must also be optimal
Bellman Optimality Equation For V
V∗(s)=amaxQ∗(s,a)
=amaxE(rt+1+γV∗(st+1)∣st=s,at=a)
=amax∑Pss′a(Rss′a+γV∗(s′))
Bellman Optimality Equation For Q
Q∗(s,a)=E(rt+1+γa′maxQ∗(st+1,a′)∣st=s)
=s′∑Pss′a(Rss′a+γa′maxQ∗(s′,a′))
Dynamic Programming Methods
Dynamic Programming methods are used to solve Markov Decision Processes (MDPs) when the model of the environment is known. They rely on the Bellman equations to iteratively compute value functions and improve policies.
Iterative Policy Evaluation
Use Bellman Equation for V to iteratively evaluate a policy.
V(st)=a∑π(a∣st)s′∑Psts′a(Rsts′a+γV(s′))
Policy Improvement Theorem
If the one-step lookahead using policy π′ is better than or equal to the value of the current policy π, π′ is at least as good as π.
If for two deterministic policies π and π′ it holds that
Qπ(s,π′(s))≥Vπ(s)∀s
then π′ is at least as good as π.
Policy Iteration
Evaluation step:
V(s)=s′∑Pss′π(s)(Rss′π(s)+γV(s′))
Improvement step:
π(s)=argmaxaQπ(s,a)=argmaxas′∑Pss′a(Rss′a+γV(s′))
Value Iteration
V(s)=amaxQ(s,a)=amaxs′∑Pss′a(Rss′a+γV(s′))
Computing the optimal policy
Given the optimal value function, we can compute the optimal policy.
Optimal Policy given V
Take the action with the highest value for the current state.
π∗(s)=argmaxas′∑Pss′a(Rss′a+γV(s′))
Optimal Policy given Q
π∗(s)=argmaxaQ(s,a)
Temporal Difference Learning
TD(λ) update
V(st)=V(st)+α(Rt(λ)−V(st))
with
Rt(λ)=(1−λ)n=1∑∞λ(n−1)Rt(n)
where
Rt(n)=rt+1+…γ(n−1)rt+n+γnV(st+n)
- (1−λ) make all terms sum to one
- λn−1 weight the returns:
- λ=0 only the 1-step return survives: 00=1, 0n=0 for all others
- λ→1: full episode, monte carlo
Eligibility Traces in backward view
Eligibility traces assign credit to recently visited states. In the backward view, all states are updated with the current TD error, weighted by their eligibility trace.
The eligibility trace decays and is updated like this:
e(s)=γλe(s)+1(s=st)
In the backward view of TD(λ), the value function is updated for all states:
V(s)=V(s)+αe(s)(rt+1+γV(st+1)−V(s))
TD(0) update
When λ=0, we only consider the one-step return. This is called TD(0) or one-step TD.
V(st)=V(st)+α(rt+1+γV(st+1)−V(st))
Monte Carlo Update rule
V(st)=V(st)+α(Rt−V(st))
For α=1/n(s), where n(s) is the number of times state s has been visited, this is the sample average method (an action-value method mentioned in the beginning of the lecture).
TD Control Methods
Sarsa
Sarsa is a temporal difference learning method that uses action values. We update them using the one-step TD target, so updates can be performed incrementally and not only after the episode is finished.
Q(st,at)=Q(st,at)+α(rt+1+γQ(st+1,at+1)−Q(st,at))
In practice, eligibility traces are used to weigh the influence of the current transition on all past states and decay over time. With them, we can have incremental updates not only to the preceding states but all states.
Sarsa(λ)
Extends SARSA by introducing eligibility traces to speed up learning.
Q-Learning
off-policy, a TD- method that learns Q-values. learns as if we always will choose the optimal action (target policy) but actually still explores with epsilon-greedy (behaviour policy). Might learn an unsafe policy compared to Sarsa.
Q(st,at)=Q(st,at)+α(rt+1+γamaxQ(st+1,a)−Q(st,at))
Policy Gradient Methods
REINFORCE
Instead of learning value functions, we directly learn a parameterized policy that selects actions. Use gradient ascent to optimize the policy parameters θ to maximize the expected return.
∇θJ(θ)=Eπ[t=0∑T−1∇θlogπθ(st,at)Gt]
Example for Policy Gradient Policy
πθ(s)=argmaxaQ^(s,a)=argmaxaθ1f1(s,a)+…+θnfn(s,a)
or softmax!
Actor-Critic Methods
REINFORCE has high variance because it uses the full return Gt as an estimate of the action value. Actor-Critic methods reduce this variance by using a learned value function as a baseline.
- The actor represents the policy and is updated via a gradient step, like in REINFORCE.
- The critic represents the value or action value function and provides feedback to the actor, e.g. the td error or advantage estimate. It is updated using TD learning.
Improvements for Policy Gradient Methods
- Introduce a baseline to reduce variance (e.g. the value function)
- Proximal Policy Optimization (PPO): Limit the size of policy updates to ensure stable learning.
- use the probability ratio between the new policy and the old policy instead of ∇θlogπ
Linear Methods
LQR
st+1=Atst+Btat
LQR reward function
Maximize
r(s0,a0)+...+r(sT,aT)
where
r(st,at)=−(stTQtst+atTRtat)
- Qt penalize being far from desired state
- Rt penalize large actions
Optimal LQR policy
at=−Ltst=π∗(st)
where
Lt=(BtTPt+1Bt−Rt)−1BtTPt+1At
Linearizing a function
Given function
f(st,at)=Atst+Btat+wt
We can linearize it with
f(st,at)≈f(sˉt,aˉt)+∇sf(sˉt,aˉt)(st−sˉt)+∇af(sˉt,aˉt)(at−aˉt)
To get back to the At and Bt matrices:
At=∇sf(sˉt,aˉt)
Bt=∇af(sˉt,aˉt)
Ct=f(sˉt,aˉt)−Atsˉt−Btaˉt
=> Pull Ct into the state by adding a dimension
Differential Dynamic Programming (DDP)
Similar to LQR, but can handle nonlinear dynamics and non-quadratic cost functions. Linearizes the dynamics and solves the LQR problem iteratively.
Backup Diagrams
Bellman Equation for V:
Bellman Equation for Q:
For Bellman Optimality Equation for V:
For Bellman Optimality Equation for Q:
Q-Learning:
Bayesian Methods
Bayesian methods provide a probabilistic framework for modeling uncertainty in Reinforcement Learning. They can be used for model learning, state estimation, and decision-making under uncertainty.
Satz von Bayes
P(A∣B)=P(B)P(B∣A)⋅P(A)
Or with two states and measurement:
p(s1∣z)=p(z)p(z∣s1)⋅p(s1)=p(z∣s1)⋅p(s1)+p(z∣s2)⋅p(s2)p(z∣s1)⋅p(s1)
You can calculate the updated p(s2) after taking an action with the same formula as p(z).
Bayesian Inference
Instead of producing just a single “best estimate” of parameters, Bayesian inference updates probability distributions over them.
Bayesian Model Inference
Bayesian Model Inference can be used to find the model that fits the data the best:
M^,θ^=M,θargmaxp(M,θ∣Dz)
It has two steps: Model Fitting and Model Comparison. In Model fitting, we find the parameters that explain the data best for each model. In Model comparison, we choose the model that fits the data best with its optimal parameters.
Fitting:
θ^=θargmaxp(θ∣M,Dz)
Comparison:
M^=Margmax∫p(M∣Dz)=Margmax∫p(M,θ∣Dz)dθ
Bayesian Information Criterion BIC
The Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models. It is based on the likelihood function and includes a penalty term for the number of parameters in the model.
BIC(M)=−2log(p(Dz∣M,θ))+klogn
Inverse Reinforcement Learning
In inverse reinforcement learning (IRL), the goal is to infer the underlying reward function that an expert is optimizing, given observations of the expert's behavior.
Feature Expectations
The feature expectations of a policy π are the expected discounted sum of feature vectors when following that policy:
μ(π)=E[t=0∑∞γtϕ(st)]
Max Margin Formula
We try to find weights W so that the reward of the expert policy is better than the reward of any other policy by a margin, using slack variables for expert suboptimality:
wTμ(πE)≥wTμ(π)+m(πE,π)−ξ,∀π
Apprenticeship Learning
Try to match feature expectations:
μ(π)≈μ(πE)
Does not require the expert to be optimal. Reward ambiguity is not a problem because the feature expectations are matched directly.
Function Approximation
Function approximation is used in Reinforcement Learning to estimate value functions or policies when the state or action space is large or continuous. It allows generalization from seen states to unseen states.
Radial Basis Function Definition
ϕi(s,a)=exp(−2σ21∥s−ci∥2)
Piecewise Linear RBF
ϕi(s,a)=max(0,1−b∥s−ci∥)Feel free to leave your opinion or questions in the comment section below.