TD learning: deeper analysis (2)

3 minute read

Published: October 27, 2022

In the previous post we proved the convergence of TD learning under the noise-free assumption. For a brief recap, the TD learning algorithm could be written in terms of $D^\pi$ and resulting noise $n_k$ as

\[Q_{k+1} = Q_k -\epsilon_k D^\pi(Q_k-\bar{r}-\gamma P^\pi Q_k) + \epsilon_k n_k.\]

In the noise-free analysis we assumed $n_k\equiv 0$ and defined a Lyapunov function $W(Q_k) = \frac{1}{2} \|Q_{k}-Q^\pi\|^2$ and proved that $W(Q_{k+1})\leq \beta W(Q_k) \leq \cdots \leq \beta^k W(Q_0)$ for $\beta \in [0,1).$

Independent noise analysis

In this post, we will generalize a noise-free analysis a little bit further. We will assume that the data-tupes used by the TD learning algorithm are drawn i.i.d from a stationary distribution $d^\pi$. Formally,

\[(s_k,a_k)\sim d^\pi, \quad s_{k+1}\sim P(\cdot \vert s_k,a_k).\]

Consequently, the expected value of the noise becomes $\mathbb{E}[n_k\vert Q_k] = 0.$

Then, instead of directly calculating $W(Q_{k+1})-W(Q_{k})$ as it was done for the noise-free casae, we take the expectation given $Q_k$.

\[\begin{aligned} \mathbb{E}\left[W(Q_{k+1})-W(Q_k)\vert Q_k \right] &= \nabla^\top W(Q_k) \mathbb{E}[Q_{k+1}-Q_k\vert Q_k] \\ &\quad \quad \: + \frac{1}{2}\mathbb{E}[\|Q_{k+1}-Q_k\|^2\vert Q_k]\\ &= \nabla^\top W(Q_k)\mathbb{E}[-\epsilon_k D^\pi(Q_k-T^\pi Q_k) + \epsilon_k n_k\vert Q_k]\\ &\quad \quad \: + \frac{1}{2}\epsilon_k^2\mathbb{E}[\|D^\pi(Q_k-T^\pi Q_k)+n_k\|^2\vert Q_k]\\ &\leq \epsilon_k \nabla^\top W(Q_k) D^\pi(Q_k-T^\pi Q_k)\\ &\quad \quad \: + \epsilon_k^2(\|D^\pi(Q_k-T^\pi Q_k)\|^2+\mathbb{E}[\|n_k\|^2\vert Q_k]). \end{aligned}\]

In the inequality we used $\| x+y \|^2 \leq 2( \| x \|^2+ \|y\|^2)$.

Further assume that the expected squared norm of $n_k$ is bounded, or formally, $\mathbb{E}[\|n_k\|^2] \leq \sigma^2$ for some $\sigma^2 < \infty$. We will later come back to this point and prove this assumption to be true. Then,

\[\leq \epsilon_k \nabla^\top W(Q_k) D^\pi(Q_k-T^\pi Q_k) + \epsilon_k^2(\|D^\pi(Q_k-T^\pi Q_k)\|^2+\sigma^2)\]

The inequality is now similar to what we have obtained in noise-free analysis, the only difference being the $\sigma^2$ term. Following the same logic from the noise-free analysis, we can obtain

\[\begin{aligned} \mathbb{E}[W(Q_{k+1})-W(Q_k)\vert Q_k] \leq -2\epsilon_k d_{\min} \left((1-\gamma) - \epsilon_k(1+\gamma^2)\right)W(Q_k) + \epsilon_k^2 \sigma^2. \end{aligned}\]

Rewriting this in terms of $\mathbb{E}[W(Q_{k+1})]$, we obtain

\[\mathbb{E}[W(Q_{k+1})] \leq \left(1-2\epsilon_k d_{\min} \left((1-\gamma) - \epsilon_k(1+\gamma^2)\right)\right)\mathbb{E}[W(Q_k)] + \frac{1}{2}\epsilon_k^2 \sigma^2.\]

If $\epsilon_k \equiv \epsilon$ for all $k$, then, as $k\rightarrow \infty$,

\[\mathbb{E}[W(Q_{k+1})] \leq \frac{\epsilon \sigma^2}{d_{\min}(1-\gamma)-\epsilon d_{\min}(1+\gamma^2)}.\]

Now, we need to prove that $\mathbb{E}[\|n_k\|^2] < \infty.$ Rewrite the Bellman equation as

\[Q_{k+1}(s,a) = (1-\epsilon_k) Q_k(s,a) +\epsilon_k (r_k + \gamma Q_k(s',a')).\]

Then,

\[\begin{aligned} \vert Q_{k+1}(s,a)\vert \leq \left(1-\epsilon_k(1-\gamma)\right)\|Q_k\|_\infty + \epsilon_k r_{\max} \quad \text{ if } (s,a) = (s_k,a_k). \end{aligned}\]

Noting that this is an asynchronous update,

\[\begin{aligned} \vert Q_{k+1}(s,a)\vert \leq \vert Q_k(s,a) \vert \leq \|Q_k\|_\infty \quad \forall (s,a) \neq (s_k,a_k). \end{aligned}\]

Compactly, we can write $\begin{aligned} \|Q_{k+1}\|_\infty \leq \max \{\left(1-\epsilon_k(1-\gamma)\right)\|Q_k\|_\infty + \epsilon_k r_{\max}, \|Q_k\|_\infty \}. \end{aligned}$

If $\epsilon_k\equiv \epsilon$,

\[\begin{aligned} \|Q_{1}\|_\infty &\leq \max \{\left(1-\epsilon(1-\gamma)\right)\|Q_0\|_\infty + \epsilon r_{\max}, \|Q_0\|_\infty \} \\ &\leq \max \{1-\epsilon(1-\gamma)+\epsilon,1\} \cdot \max \{\|Q_0\|,r_{\max}\}\\ &= (1+\epsilon \gamma) \max \{\|Q_0\|,r_{\max}\}. \end{aligned}\]

Denote $M:= \max \{\|Q_0\|,r_{\max}\}$, then

\[\|Q_{1}\|_\infty \leq (1+\epsilon \gamma)M \leq (1+\gamma) M.\]

For $Q_2,$

\[\begin{aligned} \|Q_2\|_\infty &\leq \max\{(1-\epsilon(1-\gamma))(1+\gamma)M + \epsilon M, (1+\gamma)M\}\\ &\leq \max \{(1+\gamma+\gamma^2)M, (1+\gamma)M\}\\ & = (1+\gamma +\gamma ^2)M. \end{aligned}\]

Recursively,

\[\begin{aligned} \|Q_k\|_\infty &\leq (1+\gamma + \cdots + \gamma^k)M\\ &\leq \frac{M}{1-\gamma} =: Q_{\max}. \end{aligned}\]

Now recall that

\[n_k = (D_k Q_k-D^\pi Q_k)-(D_kr_k-D^\pi\bar{r})-\gamma(E_kQ_k-D^\pi P^\pi Q_k).\]

Here, remark that

$Q_k$ bounded by $Q_{\max}$.
$D_k$, $E_k$ are nothing but matrices with each entity being either $0$ or $1$.
$D^\pi$ and $P^\pi$ are distributional matrices such that each entity is upper bounded by $1$.

Subsequently, $\|n_k\|$ and therefore $\|n_k\|_\infty$ is bounded.

Share on

Twitter Facebook LinkedIn

Minjun Sung

TD learning: deeper analysis (2)

Independent noise analysis

Share on

You May Also Enjoy

Spectral Theory

Normed and Inner product spaces

Some preliminaries on Nonlinear Control

Random Variables