Robot Learning

A formula sheet for the lecture

#Master's Studies#Robotics#Python#Machine Learning#Miscellaneous#University Studies

In the Robot Learning module, we explored how robots can use machine learning to tackle complex tasks. We began by working through Sutton and Barto’s book Reinforcement Learning: An Introduction, which provides a thorough introduction to the concepts and algorithms of reinforcement learning and is considered a standard reference in the field. I found it both engaging and accessible, even though I had little prior knowledge of the subject at the time.

The lectures also covered topics such as function approximation,transfer and meta-learning, inverse reinforcement learning, and kinematic models. For the exercises, we had to implement various algorithms in Python and solve tasks in the OpenAI Gym environment.

For the exam, however, the main focus was on understanding and applying the mathematical foundations and formulas from the book and lectures. To prepare, I created a formula sheet that summarized the key concepts and equations.

Robot Learning Formula Sheet

Fundamentals

Elements of a Reinforcement Learning System

In Reinforcement Learning, an agent interacts with an environment in discrete time steps. At each time step t, the agent receives a state sts_t from the environment and selects an action ata_t based on its policy π\pi. The environment then transitions to a new state st+1s_{t+1} and provides a reward rt+1r_{t+1} to the agent.

Markov Property

The Markov Property states that the future is independent of the past given the present. In Reinforcement Learning, this means the next state and reward depend only on the current state and action, not on the sequence of events that preceded it.

Pr(st+1=s,rt+1=rs0,a0,r1,rt,st,at)Pr(s_{t+1}=s, r_{t+1}=r \quad | \quad s_0, a_0, r_1 \ldots, r_{t}, s_t, a_t) =Pr(st+1=s,rt+1=rst,at)= Pr(s_{t+1}=s, r_{t+1}=r \quad | s_t, a_t)

Since the indices in the formula above can be confusing, we can visualize the sequence of states and actions in question like this:

alt text

Return

The return at a point t of a trajectory is the sum of the discounted future rewards

Rt=rt+1+γrt+2+γ2rt+3...=i=0γirt+1+iR_t = r_{\red{t+1}} + \gamma r_{t+2} + \gamma^2 r_{t+3}... = \sum_{i=0}^\infty \gamma^i \cdot r_{t+1+i} =rt+1+γRt+1= r_{t+1} + \gamma R_{t+1}

RL Agent goal

The goal of a Reinforcement Learning agent is to find a policy π\pi that maximizes the expected return from the start state s0s_0.

Eπ(R0)\mathbb{E}_\pi(R_0)

Bellman Equations

The Bellman Equations are fundamental recursive relationships in Reinforcement Learning that relate the value of a state (or state-action pair) to the values of its successor states (or state-action pairs). They are used to compute value functions and are the basis for many RL algorithms.

The Bellman Equation

The Bellman Equation expresses that instead of calculating the discounted cumulative reward by summing up all future rewards, we can decompose it into the immediate reward plus the discounted value of the next state.

State-Value function via Bellman Equation

The value of a state is the average return that can be acquired after that state when following a specific policy:

Vπ(s)=Eπ(Rtst=s)V^\pi(s) = \mathbb{E}_\pi(R_t\quad | \quad s_t=s) Eπ(rt+1+γVπ(st+1)st=s)\mathbb{E}_\pi(r_{t+1} + \gamma V^\pi(s_{t+1})\quad | \quad s_t=s) =aπ(as)sPssa(Rssa+γVπ(s))= \sum_a \pi(a|s) \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma V^\pi(s'))

Action-value function via Bellman Equation

Qπ(s,a)=E(Rtst=s,at=a)Q^\pi(s, a) = \mathbb{E}(R_t\quad | \quad s_t=s, a_t=a) =Eπ(rt+1+γVπ(st+1)st=s,at=a)= \mathbb{E}_\pi(r_{t+1} + \gamma V^\pi(s_{t+1})\quad |\quad s_t=s, a_t=a) =sPssa(Rssa+γVπ(s))= \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma V^\pi(s'))

Bellman Principle of Optimality

The Bellman principle of optimality states that any part of an optimal policy must itself be optimal for the subproblem starting from that point.

Therefore, the value of a state under an optimal policy, must equal the expected return for the best action from that state. Because subproblems of an optimal solution are themselves optimal: Selecting the next action is a subproblem, so we need to select the optimal action there too.

If a policy is optimal, the remaining decisions must also be optimal

Bellman Optimality Equation For V

V(s)=maxaQ(s,a)V^*(s) = \red{\max_a Q^*(s, a)} =maxaE(rt+1+γV(st+1)st=s,at=a)= \max_a \mathbb{E}(r_{t+1} + \gamma V^*(s_{t+1}) \quad|\quad s_t=s, a_t=a) =maxaPssa(Rssa+γV(s))= \max_a \sum P_{ss'}^a (R_{ss'}^a + \gamma \red{V^*(s')}) alt text

Bellman Optimality Equation For Q

Q(s,a)=E(rt+1+γmaxaQ(st+1,a)st=s)Q^*(s, a) = \mathbb{E}(r_{t+1} + \gamma \red{\max_{a'} Q^*(s_{t+1}, a')} \quad|\quad s_{t} = s) =sPssa(Rssa+γmaxaQ(s,a))= \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma \red{\max_{a'} Q^*(s', a')}) alt text

Dynamic Programming Methods

Dynamic Programming methods are used to solve Markov Decision Processes (MDPs) when the model of the environment is known. They rely on the Bellman equations to iteratively compute value functions and improve policies.

Iterative Policy Evaluation

Use Bellman Equation for V to iteratively evaluate a policy.

V(st)=aπ(ast)sPstsa(Rstsa+γV(s))V(s_t) = \sum_a \pi(a|s_t) \sum_{s'} P_{s_ts'}^a (R_{s_ts'}^a + \gamma V(s'))

Policy Improvement Theorem

If the one-step lookahead using policy π\pi' is better than or equal to the value of the current policy π\pi, π\pi' is at least as good as π\pi.

If for two deterministic policies π\pi and π\pi' it holds that

Qπ(s,π(s))Vπ(s)sQ^\pi(s, \pi'(s)) \geq V^\pi(s) \quad \forall s

then π\pi' is at least as good as π\pi.

Policy Iteration

Evaluation step:

V(s)=sPssπ(s)(Rssπ(s)+γV(s))V(s) = \sum_{s'} P_{ss'}^{\pi(s)} (R_{ss'}^{\pi(s)} + \gamma V(s'))

Improvement step:

π(s)=argmaxaQπ(s,a)=argmaxasPssa(Rssa+γV(s))\pi(s) = \text{argmax}_a Q^\pi(s, a) = \text{argmax}_a \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma V(s'))

Value Iteration

V(s)=maxaQ(s,a)=maxasPssa(Rssa+γV(s))V(s) = \max_a Q(s, a) = \max_a \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma V(s'))

Computing the optimal policy

Given the optimal value function, we can compute the optimal policy.

Optimal Policy given V

Take the action with the highest value for the current state.

π(s)=argmaxasPssa(Rssa+γV(s))\pi^*(s) = \text{argmax}_a \sum_{s'} P_{ss'}^a (R_{ss'}^a + \gamma V(s'))

Optimal Policy given Q

π(s)=argmaxaQ(s,a)\pi^*(s) = \text{argmax}_a Q(s, a)

Temporal Difference Learning

TD(λ\lambda) update

V(st)=V(st)+α(Rt(λ)V(st))V(s_t) = V(s_t) + \alpha (R_t^{(\lambda)} - V(s_t))

with

Rt(λ)=(1λ)n=1λ(n1)Rt(n)R_t^{(\lambda)} = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{(n-1)} R_t^{(n)}

where

Rt(n)=rt+1+γ(n1)rt+n+γnV(st+n)R_t^{(n)} = r_{t+1} + \ldots \gamma^{(n-1)} r_{t+n} + \gamma^n V(s_{t+n})
  • (1λ)(1-\lambda) make all terms sum to one
  • λn1\lambda^{n-1} weight the returns:
    • λ=0\lambda=0 only the 1-step return survives: 00=10^0=1, 0n=00^n=0 for all others
    • λ1\lambda \rightarrow 1: full episode, monte carlo

Eligibility Traces in backward view

Eligibility traces assign credit to recently visited states. In the backward view, all states are updated with the current TD error, weighted by their eligibility trace.

The eligibility trace decays and is updated like this:

e(s)=γλe(s)+1(s=st)e(s) = \gamma \lambda e(s) + 1(s=s_t)

In the backward view of TD(λ\lambda), the value function is updated for all states:

V(s)=V(s)+αe(s)(rt+1+γV(st+1)V(s))V(s) = V(s) + \alpha e(s) (r_{t+1} + \gamma V(s_{t+1}) - V(s))

TD(0) update

When λ=0\lambda=0, we only consider the one-step return. This is called TD(0) or one-step TD.

V(st)=V(st)+α(rt+1+γV(st+1)V(st))V(s_t) = V(s_t) + \alpha (r_{t+1} + \gamma V(s_{t+1}) - V(s_t))

Monte Carlo Update rule

V(st)=V(st)+α(RtV(st))V(s_t) = V(s_t) + \alpha (R_t - V(s_t))

For α=1/n(s)\alpha=1/n(s), where n(s) is the number of times state s has been visited, this is the sample average method (an action-value method mentioned in the beginning of the lecture).

TD Control Methods

Sarsa

Sarsa is a temporal difference learning method that uses action values. We update them using the one-step TD target, so updates can be performed incrementally and not only after the episode is finished.

Q(st,at)=Q(st,at)+α(rt+1+γQ(st+1,at+1)Q(st,at))Q(s_t, a_t) = Q(s_t, a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t))

In practice, eligibility traces are used to weigh the influence of the current transition on all past states and decay over time. With them, we can have incremental updates not only to the preceding states but all states.

Sarsa(λ\lambda)

Extends SARSA by introducing eligibility traces to speed up learning.

Q-Learning

off-policy, a TD- method that learns Q-values. learns as if we always will choose the optimal action (target policy) but actually still explores with epsilon-greedy (behaviour policy). Might learn an unsafe policy compared to Sarsa.

Q(st,at)=Q(st,at)+α(rt+1+γmaxaQ(st+1,a)Q(st,at))Q(s_t, a_t) = Q(s_t, a_t) + \alpha (r_{t+1} + \gamma \max_{a} Q(s_{t+1}, a) - Q(s_t, a_t))

Policy Gradient Methods

REINFORCE

Instead of learning value functions, we directly learn a parameterized policy that selects actions. Use gradient ascent to optimize the policy parameters θ\theta to maximize the expected return.

θJ(θ)=Eπ[t=0T1θlogπθ(st,at)Gt]\nabla_\theta J(\theta) = \mathbb{E}_\pi \left[ \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(s_t, a_t) G_t \right]

Example for Policy Gradient Policy

πθ(s)=argmaxaQ^(s,a)=argmaxaθ1f1(s,a)++θnfn(s,a)\pi_\theta(s) = \text{argmax}_a \hat Q(s,a) = \text{argmax}_a \theta_1 f_1(s,a) + \ldots + \theta_n f_n(s,a)

or softmax!

Actor-Critic Methods

REINFORCE has high variance because it uses the full return GtG_t as an estimate of the action value. Actor-Critic methods reduce this variance by using a learned value function as a baseline.

  • The actor represents the policy and is updated via a gradient step, like in REINFORCE.
  • The critic represents the value or action value function and provides feedback to the actor, e.g. the td error or advantage estimate. It is updated using TD learning.

Improvements for Policy Gradient Methods

  • Introduce a baseline to reduce variance (e.g. the value function)
  • Proximal Policy Optimization (PPO): Limit the size of policy updates to ensure stable learning.
    • use the probability ratio between the new policy and the old policy instead of θlogπ\nabla_\theta \log \pi

Linear Methods

LQR

st+1=Atst+Btats_{t+1} = A_t s_t + B_t a_t

LQR reward function

Maximize

r(s0,a0)+...+r(sT,aT)r(s_0, a_0) + ... + r(s_T, a_T)

where

r(st,at)=(stTQtst+atTRtat)r(s_t, a_t) = -(s_t^TQ_ts_t + a_t^TR_ta_t)
  • QtQ_t penalize being far from desired state
  • RtR_t penalize large actions

Optimal LQR policy

at=Ltst=π(st)a_t = -L_t s_t = \pi^*(s_t)

where

Lt=(BtTPt+1BtRt)1BtTPt+1AtL_t = (B_t^T P_{t+1} B_t - R_t)^{-1} B_t^T P_{t+1} A_t

Linearizing a function

Given function

f(st,at)=Atst+Btat+wtf(s_t, a_t) = A_t s_t + B_t a_t + w_t

We can linearize it with

f(st,at)f(sˉt,aˉt)+sf(sˉt,aˉt)(stsˉt)+af(sˉt,aˉt)(ataˉt)f(s_t, a_t) \approx f(\bar s_t, \bar a_t) + \nabla_s f(\bar s_t, \bar a_t)(s_t - \bar s_t) + \nabla_a f(\bar s_t, \bar a_t)(a_t - \bar a_t)

To get back to the AtA_t and BtB_t matrices:

At=sf(sˉt,aˉt)A_t = \nabla_s f(\bar s_t, \bar a_t) Bt=af(sˉt,aˉt)B_t = \nabla_a f(\bar s_t, \bar a_t) Ct=f(sˉt,aˉt)AtsˉtBtaˉtC_t = f(\bar s_t, \bar a_t) - A_t \bar s_t - B_t \bar a_t

=> Pull CtC_t into the state by adding a dimension

Differential Dynamic Programming (DDP)

Similar to LQR, but can handle nonlinear dynamics and non-quadratic cost functions. Linearizes the dynamics and solves the LQR problem iteratively.

Backup Diagrams

alt text alt text alt text alt text alt text

Bellman Equation for V:

Bellman Equation V

Bellman Equation for Q:

Bellman Equation Q

For Bellman Optimality Equation for V:

Bellman Optimality Equation V

For Bellman Optimality Equation for Q:

Bellman Optimality Equation Q

Q-Learning:

Q-Learning

Bayesian Methods

Bayesian methods provide a probabilistic framework for modeling uncertainty in Reinforcement Learning. They can be used for model learning, state estimation, and decision-making under uncertainty.

Satz von Bayes

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

Or with two states and measurement:

p(s1z)=p(zs1)p(s1)p(z)=p(zs1)p(s1)p(zs1)p(s1)+p(zs2)p(s2)p(s_1| z) = \frac{p(z|s_1) \cdot p(s_1)}{p(z)} = \frac{p(z|s_1) \cdot p(s_1)}{p(z|s_1) \cdot p(s_1) + p(z|s_2) \cdot p(s_2)}

You can calculate the updated p(s2)p(s2) after taking an action with the same formula as p(z)p(z).

Bayesian Inference

Instead of producing just a single “best estimate” of parameters, Bayesian inference updates probability distributions over them.

Bayesian Model Inference

Bayesian Model Inference can be used to find the model that fits the data the best:

M^,θ^=arg maxM,θp(M,θDz)\hat M, \hat \theta = \argmax_{M, \theta} p(M, \theta |D_z)

It has two steps: Model Fitting and Model Comparison. In Model fitting, we find the parameters that explain the data best for each model. In Model comparison, we choose the model that fits the data best with its optimal parameters.

Fitting:

θ^=arg maxθp(θM,Dz)\hat \theta = \argmax_\theta p(\red{\theta| M, D_z})

Comparison:

M^=arg maxMp(MDz)=arg maxMp(M,θDz)dθ\hat M = \argmax_M \int p(M|D_z) = \argmax_M \int p(\red{M, \theta | D_z})\quad d\theta

Bayesian Information Criterion BIC

The Bayesian Information Criterion (BIC) is a criterion for model selection among a finite set of models. It is based on the likelihood function and includes a penalty term for the number of parameters in the model.

BIC(M)=2log(p(DzM,θ))+klognBIC(M) = -2\log(p(\red{D_z |M, \theta})) + k \log n

Inverse Reinforcement Learning

In inverse reinforcement learning (IRL), the goal is to infer the underlying reward function that an expert is optimizing, given observations of the expert's behavior.

Feature Expectations

The feature expectations of a policy π\pi are the expected discounted sum of feature vectors when following that policy:

μ(π)=E[t=0γtϕ(st)]\mu(\pi) = \mathbb{E} \left[ \sum_{t=0}^\infty \gamma^t \phi(s_t) \right]

Max Margin Formula

We try to find weights WW so that the reward of the expert policy is better than the reward of any other policy by a margin, using slack variables for expert suboptimality:

wTμ(πE)wTμ(π)+m(πE,π)ξ,πw^T \mu(\pi_E) \geq w^T \mu(\pi) + m(\pi_E, \pi) - \xi, \quad \forall \pi

Apprenticeship Learning

Try to match feature expectations:

μ(π)μ(πE)\mu(\pi) \approx \mu(\pi_E)

Does not require the expert to be optimal. Reward ambiguity is not a problem because the feature expectations are matched directly.

Function Approximation

Function approximation is used in Reinforcement Learning to estimate value functions or policies when the state or action space is large or continuous. It allows generalization from seen states to unseen states.

Radial Basis Function Definition

ϕi(s,a)=exp(12σ2sci2)\phi_i(s, a) = \exp\left(-\frac{1}{2\sigma^2} \| s - c_i \|^2 \right)

Piecewise Linear RBF

ϕi(s,a)=max(0,1scib)\phi_i(s, a) = \max\left(0, 1 - \frac{\| s - c_i \|}{b} \right)

Comments

Feel free to leave your opinion or questions in the comment section below.