Spectral Theory
Published:
In this post, the definition of a spectrum of a function and its spectral radius is provided. Then, some useful properties are stated.
Published:
In this post, the definition of a spectrum of a function and its spectral radius is provided. Then, some useful properties are stated.
Published:
In this post, some definitions of the normed and inner product space is given with illustrative examples. Then, the small gain theorem is stated.
Published:
In this post, we review some mathematical preliminaries that are important in understanding the fundamentals of Nonlinear Control theory.
Published:
In this post, we review some basic definitions to understand the fundamentals of random variables.
Published:
In this post, we review some preliminaries and axiioms to understand the fundamentals of random processes.
Published:
In this post, we will briefly explain what Gradient Descent (GD) is, how it works, why it is useful, and where it is used.
Published:
As we discussed in the previous post, in this post, we will prove the convergence of the Q-learning algorithm using some useful norm tricks and contraction theorem.
Published:
As we have previously discussed in post that TD learning uses the mean estimate method to update the mean estimation of the $Q$ value. We proved in this post the convergence of $Q$ values following the algorithm:
Published:
Just like the analysis that we did with the tabular TD learning algorithm in this post, here we will prove the convergence of the TD learning with LFA for wisely selected $\epsilon_k$ under the i.i.d. data sample assumption. Formally,
Published:
From the last post we obtained the algorithm for TD learning with LFA:
Published:
In the two previous posts( post1 and post2), we proved the convergence of the TD learning algorithm under noise-free and i.i.d sampling assumption, respectively. This guaranteed convergence of the algorithm makes the TD learning very powerful in solving reinforcement learning problems. However, one drawback of this method is in its asynchronisity, which was briefly mentioned in the conclusion of this post. When we update for \(Q_{k+1}(s,a)\), \(\{(s',a')\in \mathcal{S}\times\mathcal{A} \vert (s',a') \neq (s,a)\}\) are not updated such that \(Q_{k+1}(s',a') = Q_{k}(s',a')\).
Published:
In the previous post we proved the convergence of TD learning under the noise-free assumption. For a brief recap, the TD learning algorithm could be written in terms of $D^\pi$ and resulting noise $n_k$ as
Published:
In the previous post, we have discussed the basic idea about the TD learning. In this post, we will go deeper into its analysis and prove the convergence of the TD learning under noise-free case, which we will describe shortly.
Published:
Value iteration and policy iteration methods as discussed in the previous post, requires the knowledge of the transition matrix. To be precise, the calculation of the Bellman operator requires the calculation of the expected value of the value function (or a Q-function) of the next possible state (and action) given current state (and action). In other words, the model of the Markov Decision Process (MDP) is assumed to be known in advance. This was a feasible assumption when we have an access to the system or a simulator used to collect the data. However, it is not always the case, and therefore we need some clever method to find an optimal policy under such circumstances.
Published:
Value iteration and policy iteration are two algorithmic frameworks for solving reinforcement learning problems. Both frameworks involve iteratively improving the estimates of the value function (or the Q function) in order to find the optimal policy, which is the policy that maximizes the expected return.
Published:
The contraction mapping theorem is a fundamental result in mathematics that states that if a function is a contraction mapping, then it has a unique fixed point. In the context of reinforcement learning, a fixed point corresponds to an optimal policy, which is a function that maps states to actions and maximizes the expected return.