TD learning with Linear Function Approximation: i.i.d sampling
Published:
Just like the analysis that we did with the tabular TD learning algorithm in this post, here we will prove the convergence of the TD learning with LFA for wisely selected $\epsilon_k$ under the i.i.d. data sample assumption. Formally,
\[(s_k,a_k)\sim d^\pi, \quad (s_k',a_k')\sim P(\cdot\vert s_k,a_k)\pi(\cdot\vert \cdot)\]where \(d^\pi\) is the stationary distribution under policy $\pi$. We further assume that the Markov chain is irreducible and aperiodic.
Independent noise analysis
We start with the TD learning with LFA algorithm we obtained in this post:
\[\theta_{k+1} = \theta_k-\epsilon_k\left(\Phi^\top D^\pi \Phi \theta_k - \Phi^\top D^\pi r-\gamma \Phi^\top D^\pi P^\pi\Phi \theta_k\right) - \epsilon_k n_k.\]Unlike the noise-free analysis, we are sampling $n_k$ from a stationary distribution $d^\pi$, meaning that $n_k$ is $0$ only in terms of expectation, so we cannot remove it in the calculation. As such, instead of directly calculating the Lyapunov function $W(\theta_{k+1})-W(\theta_k)$ as we did in the previous post, we calculate the expectation of it.
\[\begin{aligned} \mathbb{E}[W(\theta_{k+1})- W(\theta_k)\vert \theta_k] = (\theta_k-\theta^\star)^\top \mathbb{E}[(\theta_{k+1}-\theta_k)\vert \theta_k] + \mathbb{E}[\frac{1}{2}\|\theta_{k+1}-\theta_k\|_2^2\vert \theta_k]. \end{aligned}\]Using \((a+b)^2 \leq 2a^2 + 2b^2\), we have that
\[\begin{aligned} \mathbb{E}[\frac{1}{2}\|\theta_{k+1}-\theta_k\|_2^2\vert \theta_k] &\leq \epsilon_k^2 \|\Phi^\top D^\pi (\Phi \theta_k - T^\pi(\Phi \theta_k))\|^2_2 + \epsilon_k^2 \|n_k\|^2\\ &\leq \epsilon_k^2 \|\Phi^\top D^\pi (\Phi \theta_k - T^\pi(\Phi \theta_k))\|^2_2 + \epsilon_k^2 \sigma^2 \end{aligned}\]where the second inequality comes from the fact that the variance of the noise is bounded. This is essentially true, because
\[n_k := \Phi^\top (D_k-D^\pi)\Phi \theta_k - \Phi^\top(D_kr-D^\pi r) - \gamma \Phi^\top(E_k-D^\pi P^\pi)\Phi \theta_k\]and $D_k, E_k$ are a matrix of $0$ and $1$, whereas the elements of $D^\pi$ and $P^\pi$ are upper bounded by 1 since each element represent a probability.
Now we have
\[\begin{aligned} \mathbb{E}[W(\theta_{k+1})- W(\theta_k)\vert \theta_k] &\leq \epsilon_k (\theta_k-\theta^\star)^\top(\Phi^\top D^\pi (\Phi\theta_k - T^\pi(\Phi \theta_k))) \\ &\quad \quad + \epsilon_k^2\|\Phi^\top D^\pi(\Phi \theta_k-T^\pi(\Phi\theta_k))\|^2_2 + \epsilon_k^2\sigma^2 \end{aligned}\]Therefore, we obtain
\[\begin{aligned} \mathbb{E}[W(\theta_{k+1})] &\leq (1-\epsilon_k(1-\gamma)d_{\min}\lambda_{\min}+ \epsilon_k^2\lambda_{\max}d_{\min}(1+\gamma^2))\mathbb{E}[W(\theta_k)] + \epsilon_k^2\sigma^2 \end{aligned}\]and this, asymptotically,
\[\begin{aligned} \mathbb{E}[W(\theta_k)] &\leq \frac{\epsilon_k \sigma^2}{(1-\gamma)d_{\min}\lambda_{\min} + \epsilon_k \lambda_{\max}d_{\min}} . \end{aligned}\]With a right choice of \(\epsilon_k\), we can guarantee its convergence. Take, for instance, \(\epsilon_k = \frac{1}{k+1}\). Then, we have
\[\begin{aligned} \mathbb{E}[W(\theta_k)] &\leq \frac{\sigma^2}{(k+1)((1-\gamma)d_{\min}\lambda_{\min})} \end{aligned}\]which converges to $0$ as $k\rightarrow \infty$. This completes the proof of convergence of TD learning with LFA under i.i.d sampling assumption.