TD learning: deeper analysis (2)

3 minute read

Published:

In the previous post we proved the convergence of TD learning under the noise-free assumption. For a brief recap, the TD learning algorithm could be written in terms of $D^\pi$ and resulting noise $n_k$ as

\[Q_{k+1} = Q_k -\epsilon_k D^\pi(Q_k-\bar{r}-\gamma P^\pi Q_k) + \epsilon_k n_k.\]

In the noise-free analysis we assumed \(n_k\equiv 0\) and defined a Lyapunov function \(W(Q_k) = \frac{1}{2} \|Q_{k}-Q^\pi\|^2\) and proved that \(W(Q_{k+1})\leq \beta W(Q_k) \leq \cdots \leq \beta^k W(Q_0)\) for \(\beta \in [0,1).\)

Independent noise analysis

In this post, we will generalize a noise-free analysis a little bit further. We will assume that the data-tupes used by the TD learning algorithm are drawn i.i.d from a stationary distribution $d^\pi$. Formally,

\[(s_k,a_k)\sim d^\pi, \quad s_{k+1}\sim P(\cdot \vert s_k,a_k).\]

Consequently, the expected value of the noise becomes \(\mathbb{E}[n_k\vert Q_k] = 0.\)

Then, instead of directly calculating \(W(Q_{k+1})-W(Q_{k})\) as it was done for the noise-free casae, we take the expectation given $Q_k$.

\[\begin{aligned} \mathbb{E}\left[W(Q_{k+1})-W(Q_k)\vert Q_k \right] &= \nabla^\top W(Q_k) \mathbb{E}[Q_{k+1}-Q_k\vert Q_k] \\ &\quad \quad \: + \frac{1}{2}\mathbb{E}[\|Q_{k+1}-Q_k\|^2\vert Q_k]\\ &= \nabla^\top W(Q_k)\mathbb{E}[-\epsilon_k D^\pi(Q_k-T^\pi Q_k) + \epsilon_k n_k\vert Q_k]\\ &\quad \quad \: + \frac{1}{2}\epsilon_k^2\mathbb{E}[\|D^\pi(Q_k-T^\pi Q_k)+n_k\|^2\vert Q_k]\\ &\leq \epsilon_k \nabla^\top W(Q_k) D^\pi(Q_k-T^\pi Q_k)\\ &\quad \quad \: + \epsilon_k^2(\|D^\pi(Q_k-T^\pi Q_k)\|^2+\mathbb{E}[\|n_k\|^2\vert Q_k]). \end{aligned}\]

In the inequality we used \(\| x+y \|^2 \leq 2( \| x \|^2+ \|y\|^2)\).

Further assume that the expected squared norm of $n_k$ is bounded, or formally, \(\mathbb{E}[\|n_k\|^2] \leq \sigma^2\) for some $\sigma^2 < \infty$. We will later come back to this point and prove this assumption to be true. Then,

\[\leq \epsilon_k \nabla^\top W(Q_k) D^\pi(Q_k-T^\pi Q_k) + \epsilon_k^2(\|D^\pi(Q_k-T^\pi Q_k)\|^2+\sigma^2)\]

The inequality is now similar to what we have obtained in noise-free analysis, the only difference being the $\sigma^2$ term. Following the same logic from the noise-free analysis, we can obtain

\[\begin{aligned} \mathbb{E}[W(Q_{k+1})-W(Q_k)\vert Q_k] \leq -2\epsilon_k d_{\min} \left((1-\gamma) - \epsilon_k(1+\gamma^2)\right)W(Q_k) + \epsilon_k^2 \sigma^2. \end{aligned}\]

Rewriting this in terms of \(\mathbb{E}[W(Q_{k+1})]\), we obtain

\[\mathbb{E}[W(Q_{k+1})] \leq \left(1-2\epsilon_k d_{\min} \left((1-\gamma) - \epsilon_k(1+\gamma^2)\right)\right)\mathbb{E}[W(Q_k)] + \frac{1}{2}\epsilon_k^2 \sigma^2.\]

If $\epsilon_k \equiv \epsilon$ for all $k$, then, as $k\rightarrow \infty$,

\[\mathbb{E}[W(Q_{k+1})] \leq \frac{\epsilon \sigma^2}{d_{\min}(1-\gamma)-\epsilon d_{\min}(1+\gamma^2)}.\]

Now, we need to prove that \(\mathbb{E}[\|n_k\|^2] < \infty.\) Rewrite the Bellman equation as

\[Q_{k+1}(s,a) = (1-\epsilon_k) Q_k(s,a) +\epsilon_k (r_k + \gamma Q_k(s',a')).\]

Then,

\[\begin{aligned} \vert Q_{k+1}(s,a)\vert \leq \left(1-\epsilon_k(1-\gamma)\right)\|Q_k\|_\infty + \epsilon_k r_{\max} \quad \text{ if } (s,a) = (s_k,a_k). \end{aligned}\]

Noting that this is an asynchronous update,

\[\begin{aligned} \vert Q_{k+1}(s,a)\vert \leq \vert Q_k(s,a) \vert \leq \|Q_k\|_\infty \quad \forall (s,a) \neq (s_k,a_k). \end{aligned}\]

Compactly, we can write \(\begin{aligned} \|Q_{k+1}\|_\infty \leq \max \{\left(1-\epsilon_k(1-\gamma)\right)\|Q_k\|_\infty + \epsilon_k r_{\max}, \|Q_k\|_\infty \}. \end{aligned}\)

If $\epsilon_k\equiv \epsilon$,

\[\begin{aligned} \|Q_{1}\|_\infty &\leq \max \{\left(1-\epsilon(1-\gamma)\right)\|Q_0\|_\infty + \epsilon r_{\max}, \|Q_0\|_\infty \} \\ &\leq \max \{1-\epsilon(1-\gamma)+\epsilon,1\} \cdot \max \{\|Q_0\|,r_{\max}\}\\ &= (1+\epsilon \gamma) \max \{\|Q_0\|,r_{\max}\}. \end{aligned}\]

Denote \(M:= \max \{\|Q_0\|,r_{\max}\}\), then

\[\|Q_{1}\|_\infty \leq (1+\epsilon \gamma)M \leq (1+\gamma) M.\]

For $Q_2,$

\[\begin{aligned} \|Q_2\|_\infty &\leq \max\{(1-\epsilon(1-\gamma))(1+\gamma)M + \epsilon M, (1+\gamma)M\}\\ &\leq \max \{(1+\gamma+\gamma^2)M, (1+\gamma)M\}\\ & = (1+\gamma +\gamma ^2)M. \end{aligned}\]

Recursively,

\[\begin{aligned} \|Q_k\|_\infty &\leq (1+\gamma + \cdots + \gamma^k)M\\ &\leq \frac{M}{1-\gamma} =: Q_{\max}. \end{aligned}\]

Now recall that

\[n_k = (D_k Q_k-D^\pi Q_k)-(D_kr_k-D^\pi\bar{r})-\gamma(E_kQ_k-D^\pi P^\pi Q_k).\]

Here, remark that

  1. $Q_k$ bounded by $Q_{\max}$.
  2. $D_k$, $E_k$ are nothing but matrices with each entity being either $0$ or $1$.
  3. $D^\pi$ and $P^\pi$ are distributional matrices such that each entity is upper bounded by $1$.

Subsequently, \(\|n_k\|\) and therefore \(\|n_k\|_\infty\) is bounded.