Exercise 3.26 - Optimal action-value in terms of optimal state-value and dynamics Problem Statement Give an equation for $q_$ in terms of $v_$ and the four-argument $p$. Solution The four-argument $p$ refers to the environment dynamics $p(s', r | s, a)$. As shown in Exercise 3.19, the action-value can be expressed in terms of the state-value as follows $$q_\pi(s, a) = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s, A_t = a]$$ So under the optimal policy $\pi_*$ this relationship becomes $$q__(s, a) = \mathbb{E}[R_{t+1} + \gamma v__(S_{t+1}) | S_t = s, A_t = a]$$ This expected value is taken with respect to the four-argument $p$ hence $$\therefore \boxed{q__(s,a) = \sum\limits_{s', r} p(s', r | s,a)[r + \gamma v__(s')]}$$ The optimal action-value is the average of the immediate reward and the discounted optimal state-value over successor states.