Blog posts

2023

Spectral Theory

1 minute read

Published:

In this post, the definition of a spectrum of a function and its spectral radius is provided. Then, some useful properties are stated.

Normed and Inner product spaces

2 minute read

Published:

In this post, some definitions of the normed and inner product space is given with illustrative examples. Then, the small gain theorem is stated.

Some preliminaries on Nonlinear Control

9 minute read

Published:

In this post, we review some mathematical preliminaries that are important in understanding the fundamentals of Nonlinear Control theory.

Random Variables

3 minute read

Published:

In this post, we review some basic definitions to understand the fundamentals of random variables.

Random Processes

5 minute read

Published:

In this post, we review some preliminaries and axiioms to understand the fundamentals of random processes.

Gradient Descent

4 minute read

Published:

In this post, we will briefly explain what Gradient Descent (GD) is, how it works, why it is useful, and where it is used.

Q-Learning: Convergence of the algorithm

1 minute read

Published:

As we discussed in the previous post, in this post, we will prove the convergence of the Q-learning algorithm using some useful norm tricks and contraction theorem.

2022

Q-Learning: Understanding the idea

2 minute read

Published:

As we have previously discussed in post that TD learning uses the mean estimate method to update the mean estimation of the $Q$ value. We proved in this post the convergence of $Q$ values following the algorithm:

TD learning with Linear Function Approximation: i.i.d sampling

2 minute read

Published:

Just like the analysis that we did with the tabular TD learning algorithm in this post, here we will prove the convergence of the TD learning with LFA for wisely selected $\epsilon_k$ under the i.i.d. data sample assumption. Formally,

Understanding TD learning with Linear Function Approximation

2 minute read

Published:

In the two previous posts( post1 and post2), we proved the convergence of the TD learning algorithm under noise-free and i.i.d sampling assumption, respectively. This guaranteed convergence of the algorithm makes the TD learning very powerful in solving reinforcement learning problems. However, one drawback of this method is in its asynchronisity, which was briefly mentioned in the conclusion of this post. When we update for \(Q_{k+1}(s,a)\), \(\{(s',a')\in \mathcal{S}\times\mathcal{A} \vert (s',a') \neq (s,a)\}\) are not updated such that \(Q_{k+1}(s',a') = Q_{k}(s',a')\).

TD learning: deeper analysis (2)

3 minute read

Published:

In the previous post we proved the convergence of TD learning under the noise-free assumption. For a brief recap, the TD learning algorithm could be written in terms of $D^\pi$ and resulting noise $n_k$ as

TD learning: deeper analysis (1)

5 minute read

Published:

In the previous post, we have discussed the basic idea about the TD learning. In this post, we will go deeper into its analysis and prove the convergence of the TD learning under noise-free case, which we will describe shortly.

Temporal Difference (TD) Learning: Understanding the idea

4 minute read

Published:

Value iteration and policy iteration methods as discussed in the previous post, requires the knowledge of the transition matrix. To be precise, the calculation of the Bellman operator requires the calculation of the expected value of the value function (or a Q-function) of the next possible state (and action) given current state (and action). In other words, the model of the Markov Decision Process (MDP) is assumed to be known in advance. This was a feasible assumption when we have an access to the system or a simulator used to collect the data. However, it is not always the case, and therefore we need some clever method to find an optimal policy under such circumstances.

Value Iteration and Policy Iteration: why it works

5 minute read

Published:

Value iteration and policy iteration are two algorithmic frameworks for solving reinforcement learning problems. Both frameworks involve iteratively improving the estimates of the value function (or the Q function) in order to find the optimal policy, which is the policy that maximizes the expected return.

Bellman equation and Contraction mapping theorem

5 minute read

Published:

The contraction mapping theorem is a fundamental result in mathematics that states that if a function is a contraction mapping, then it has a unique fixed point. In the context of reinforcement learning, a fixed point corresponds to an optimal policy, which is a function that maps states to actions and maximizes the expected return.