A common choice is to choose the lastst policy to sample from the environment. Before collecting more data you can take K steps. ![[q_iter.png]]