Problem Statement
Give an equation for $\pi_$ in terms of $v_$ and the four-argument
As in Exercise 3.27, the optimal policy will be greedy with respect to the optimal state-value function as well. However, it will need to use the environment dynamics to do a one-step look-ahead search over state transitions. Intuitively, the optimal policy selects the actions that maximize the expected immediate reward plus the discounted optimal state-value of the next states where the expectation is taken over all possible successor states.
$\pi_(a | s) = \begin{cases} 1 \quad \text{if } a \in \argmax\limits_{a \in \mathcal{A}(s)} \mathbb{E}[R_{t+1} + \gamma v_(S_{t+1}) | S_t = s, A_t=a] \ 0 \quad \text{otherwise} \end{cases}$
By the definition of this expectation
Another way to see this is the fact that
As shown in Exercise 3.26. And by Exercise 3.27, the optimal policy places probability 1 on the action(s) that maximize