Value Iteration and Policy Iteration: why it works
Published:
Value iteration and policy iteration are two algorithmic frameworks for solving reinforcement learning problems. Both frameworks involve iteratively improving the estimates of the value function (or the Q function) in order to find the optimal policy, which is the policy that maximizes the expected return.
Value iteration is an algorithm for finding the optimal value function, which is the value function that maximizes the expected return for all states. It starts with an initial estimate of the value function and iteratively improves it using the Bellman equation until convergence. The optimal policy can then be derived from the optimal value function.
On the other hand, policy iteration is an algorithm for finding the optimal policy. It starts with an initial policy and iteratively improves it using the Bellman equation until convergence. The value function is updated at each iteration using the current policy.
Value iteration
We begin with the algorithm.
- Initialize $V_0(s)$ for all $s$. (e.g. $V_0(s) = 0 \quad \forall s\in \mathcal{S}$)
- $V_{k+1}(s) = TV_k(s) (:= \max_a {\bar{r}(s,a)+\gamma\mathbb{E}[V_k(s’)\vert s_0 = s, a_0 = a]} )$
- Repeat step 2 until $V$ converges.
- Obtain optimal policy $\pi^\star(s) = \arg \max_a {\bar{r}(s,a)+ \gamma \mathbb{E}[V^\star(s’)\vert s_0 = s, a_0 = a]}$
Note that the policy is simply a greedy policy with respect to the value function at each time step. Formally, we can express this as $TV_k = T^{\pi_k}V_k$.
In the previous post it was proven that the Bellman operator $T^\pi$ is a contraction mapping such that the convergence of the value function is guaranteed for any fixed policy $\pi$. Leveraging this fact, we can show the convergence and even find the convergence rate of the Value iteration algorithm.
Theorem (Rate of convergence).
\[\|V_k - V^\star\|_\infty \leq \frac{\gamma^k}{1-\gamma}\|V_1-V_0\|_\infty\]Proof. By the contraction mapping $T$,
\[\begin{aligned} \|V_k-V_{k-1}\| &= \|TV_{k-1} - TV_{k-2}\|\\ &\leq \gamma \|V_{k-1} - V_{k-2}\|\\ & \quad \quad \quad \vdots\\ &\leq \gamma^{k-1}\|V_1-V_0\| \end{aligned}\]all the norm being the infinity norm.
Using this fact and triangular inequality,
\[\begin{aligned} \|V_k-V^\star\| &= \|V_k+V_{k+1}-V_{k+1} + V_{k+2} - V_{k+2} \cdots \|\\ &\leq \|V_k-V_{k+1}\| + \|V_{k+1}- V_{k+2}\| + \cdots\\ &\leq \gamma^k\|V_1-V_0\| + \gamma^{k+1}\|V_1-V_0\|+\cdots\\ &= \lim_{n\rightarrow \infty} \|V_1-V_0\|\gamma^k(1+\gamma+\cdots+\gamma^n)\\ &= \|V_1-V_0\| \frac{\gamma^k}{1-\gamma} \end{aligned}\]again, all being the infinity norm $\blacksquare$
One might have noticed that this proof is built upon the idea that the Bellman operator $T$ is a contraction mapping. However, such fact was proven in the previous post provided that the policy $\pi$ is given. In this case, this means that the convergence of the value function $V$ to $V^\star$ can only be true if we already know the optimal policy $\pi^\star$. Of course, this is hardly the case in practice. Therefore, we should conduct a better or practical evaluation of the convergence of the algorithm. Below, we evaluate the convergence with respect to the greedy policy at each time step.
Theorem (VI Rate of convergence w.r.t. policy at each time step). \(\|V^{\pi_k}-V^\star\|_\infty = \frac{2\gamma^k}{1-\gamma} \|V_1-V_0\|_\infty\) which states how close a value function with respect to policy at time step $k$ is close to the value function with respect to the optimal policy.
Proof.
All the norms are \(\|\cdot\|_\infty\).
\[\begin{aligned} \|V^{\pi_k}-V^\star\| &= \|V_k+V_{k+1}-V_{k+1} + V_{k+2} - V_{k+2} \cdots \|\\ &\leq \|V_k-V_{k+1}\| + \|V_{k+1}- V_{k+2}\| + \cdots\\ &\leq \gamma^k\|V_1-V_0\| + \gamma^{k+1}\|V_1-V_0\|+\cdots\\ &= \lim_{n\rightarrow \infty} \|V_1-V_0\|\gamma^k(1+\gamma+\cdots+\gamma^n)\\ &= \|V_1-V_0\| \frac{\gamma^k}{1-\gamma} \end{aligned}\]Policy iteration
Policy iteration follows the following algorithm:
- Initialize policy $\pi_0$
- Evaluate $V$ according to policy $\pi_k$ (policy evaluation)
- Find $\pi_{k+1}$ greedy respect to $V^{\pi_k}$ (policy improvement)
- $k\leftarrow k+1$ and repeat 2-4 until $\pi$ converges
Step 3 can compactly written as $TV^{\pi_k} = T^{\pi_{k+1}}V^{\pi_k}$ as done similarly in value iteration.
Similar to how we evaluated the Value iteration, we can do the rate of convergence analysis with policy iteration.
Theorem (PI Rate of convergence w.r.t. policy at each time step).
\[\|V^{\pi_k}-V^\star\|_\infty = \gamma^k \|V^{\pi_0}-V^\star \|_\infty.\]Proof.
All the norms are \(\| \cdot \|_\infty\).
First we prove that $V^{\pi_{k+1}} \geq TV^{\pi_k} \geq V^{\pi_k}$. We have that $TV^{\pi_k} = T^{\pi_{k+1}}V^{\pi_k}$. Therefore, it is clear that
\[TV^{\pi_k}:= \max_{\pi} T^\pi V^{\pi_k} = T^{\pi_{k+1}}V^{\pi_k} \geq V^{\pi_k}.\]Noting that $T^{\pi_{k+1}}V^{\pi_{k+1}} = V^{\pi_{k+1}},$ we obtain
\[V^{\pi_{k+1}} \geq TV^{\pi_k} \geq V^{\pi_k}.\]Now, since $V^\star := \max_\pi V^\pi$, we have
\(0\leq V^\star -V^{\pi_{k+1}} \leq V^\star -TV^{\pi_k}.\) Therefore,
\[\begin{aligned} \|V^\star - V^{\pi_{k+1}} \|_\infty &\leq \|V^\star -TV^{\pi_{k}}\|_\infty \\ &\leq \gamma\|V^\star - V^{\pi_k}\|_\infty\\ &\quad \quad \quad \vdots\\ &\leq \gamma^k\|V^\star - V^{\pi_0}\|_\infty. \end{aligned}\]Rearranging this yields the statement of the theorem $\blacksquare$.
Note the difference between that of VI, which proves a convergence rate as a function of \(\|V_1-V_0\|_\infty\). This is a very useful result because as soon as we obtain $V_1$, we have the knowledge about the upper bound of the error of the value function to the optimal one at each time $k$. Compared to this, although we know that the policy iteration has geometric convergence, since we do not know $V^\star$ in advance, this analysis might seem as practical. However, with some further analysis, we can bound the runtime of the policy iteration. Such theorem and proof can be found in this post. The runtime upperbound of value iteration is analyzed here.
As a rule of a thumb, policy iteration converges faster than the value iteration. Also the run time upper bound evaluation tells that the runtime upperbound is higher for the value iteration. However, we cannot hastily conclude from this evaluation because the upperbound evaluation simply might not be tight or precise. Moreover, value iteration has the advantage of being more robust to changes in the environment. This is because value iteration only involves updating the value function, which is based on the immediate reward and the value of the next state. As a result, value iteration is less sensitive to changes in the transition probabilities or the reward function than policy iteration, which involves updating the policy based on the current value function.