We implement a deep learning model to successfully learn control policies in the game of Breakout from frames, based on the DeepMind DQN paper. The model is a convolutional neural network trained with a variant of Q-learning, where the input consists of raw pixels and the output is a value function estimating future rewards. We apply this method to the game of Breakout and achieve a human level of performance.
The games of Atari serve as a benchmark for evaluating reinforcement learning algorithms due to their challenging dynamics and the necessity for effective control policies. In this study, we leverage the principles outlined in the DeepMind DQN paper to play the game of Breakout, which introduced a novel approach to combining deep learning with reinforcement learning. By utilizing convolutional neural networks, we aim to learn optimal control strategies directly from the raw pixel data of the game, thereby eliminating the need for handcrafted features.
The DQN architecture employs a deep neural network to approximate the Q-value function, which predicts the expected future rewards for each action given a specific state. This allows the model to make informed decisions based on the visual input it receives. Our implementation focuses on the unique challenges posed by Breakout, where the agent must learn to navigate the environment, track the ball, and strategically position the paddle to maximize its score.
Through a series of experiments, we demonstrate that our model not only learns to play Breakout effectively but also achieves performance levels comparable to those of human players. This showcases the potential of deep learning techniques in mastering complex tasks that require real-time decision-making based on visual inputs.
Q-learning is one of the popular algorithms used in RL, since it can be used in model-free setups (both for evaluation and improvement) and it's also off-policy; meaning that we can make use of the expreinces from a given policy (namely the behaviour policy), to improve a desired target policy. The formula for the algorithm is as follows:
$ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) \space \text{eq.1} $
(where a is chosen from the behaviour policy and $ a' $ is chosen from the target policy. Also note that $ r + \gamma \max_{a'} Q(s', a') $ is usually called the target value)
Only having Q-learning algorithm doesn't do any good for us, unless we are working with a very simple environment. since in most of the cases we are dealing with an environment with a large state space and action space, we can represent action-value (or sometimes called
The neural network that we have used in our work is explained in detail in Image Processing Network section.
Since our work is on a Atari game, for processing frame, we need to use a CNN. The specifications of the CNN we used is as follows:
-
First layer: a convolutional layer with input channel of 4, number of filters of 32 and kernel size of (8, 8), with an stride of 4, followed by a reLU.
-
Second layer: a convolutional layer with input channel of 32, number of filters of 64 and kernel size of (4, 4), with an stride of 2, followed by a reLU.
-
Third layer: a convolutional layer with input channel of 64, number of filters of 64 and kernel size of (3, 3), with an stride of 1, followed by a reLU and then a flattening.
-
Forth layer: a fully connected layer with input features of size 3136, output features of size 512, followed by a reLU.
-
Fifth and the last layer: a fully connected layer with input features of size 512, output features of size 4.
Before diving into the DQN algorithm, we need to define a few things first:
As we said earlier, the Q-learning algorithm could be used in a off-policy setting, that means we can utilize the experinces that we have had from other policies in our work to improve our current policy. In other words, we can make use of our previous expreinces in the game (which have been caused by our previous policies) to train our agent again, and again. This technique is called is done using a Replay Buffer which is essentially a storage of our previous expreinces (which are of the form $ e_t = (s_t, a_t, r_t, s_{t+1}) $). More specifically, while we are interacting with the environment, we can sample a mini-batches of our previous expreinces from the Replay Buffer and improve our policy.
Our Replay Buffer is of size 50000 expreinces, and we sample mini-batches of size 32 from it. We don't take sample mini-bathces for the first few interactions of the agent with the environment (namely this is called the warm-up phase).
In order to stablize the learning, we use two different networks (these two networks are the same expect for their weights. the target network weights get updated less frequently):
-
Q-Network: This is the primary network that we want to use in our work to pick the best actions while the agent is interacting with the environment. We will denote this network by $ Q $ and its weights by $ \theta $
-
Target Network: This network will be used when we want to approximate the target value (defined in Q-Learning section in eq.1). The weights of this network are updated less frequently in order to stablize our algorithm. We will denote this network by $ \hat{Q} $ and its weights by $ \theta^- $
Since Exploration-Exploiation is one of the main issues of Reinforcement Learning algorithms which are based on Value-Iteration, we consider an decaying variable (most of the times called $ \epsilon $) which controls the probability of taking a random action while interacting with the environment to help the agent explore the environment good enough to be able to have a successful exploitation phase later on.
We will explain how exactly we will use the decaying $ \epsilon $ in our work in DQN Algorithm section, but here we will provide the specifications of how we initialize and decay $ \epsilon $:
initialization value: $ 1.0 $, final value: $ 0.1 $, the number of frames used to decay: $ 10^6 $ (we decay the $ \epsilon $ linearly).
We will use RMSprop as the optimizer for this work, for details on hyperparamters please refer to Appendix section.
The following is the pseudocode of the DQN algorithm:
$ \text{Initialize replay memory } D \text{ to capacity } N \ \text{Initialize action-value function } Q \text{ with random weights } \theta \ \text{Initialize action-value function } Q' \text{ with weights } \theta^- = \theta\ \text{For each episode:} \ \quad \text{Initialize sequence } s_1 = {x_1} \text{ and preprocessed sequence } \phi_1 = \phi(s_1) \ \quad \text{For } t = 1 \text{ to } T: \ \quad \quad \text{With probability } \epsilon \text{ select a random action } a_t \ \quad \quad \text{Otherwise select } a_t = \arg\max_a Q(\phi_t, a; \theta) \ \quad \quad \text{Execute action } a_t \text{ and observe reward } r_t \text{ and new image } x_{t+1} \ \quad \quad \text{Set } s_{t+1} = s_t, a_t, x_{t+1} \text{ and preprocess } \phi_{t+1} = \text{process}(s_{t+1}) \ \quad \quad \text{Store transition } ( \phi_t, a_t, r_t, \phi_{t+1} ) \text{ in } D \ \quad \quad \text{Sample random minibatch of transitions } ( \phi_j, a_j, r_j, \phi_{j+1} ) \text{ from } D \ \quad \quad \text{Set } y_j = \begin{cases} r_j & \text{if episode terminates at step } j+1 \ r_j + \gamma \max_{a'} \hat{Q}(\phi_{j+1}, a'; \theta^-) & \text{otherwise} \end{cases} \ \quad \quad \text{Perform a gradient descent step on } (y_j - Q(\phi_j, a_j; \theta))^2 \ \quad \text{Every C steps reset } \hat{Q} = Q \ \text{End For} $
Working directly with raw Atari frames, which are 210 x 160 RGB pixel images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality. First, to encode a single frame we take the maximum value for each pixel color value over the frame being encoded and the previous frame. The frame is then converted to grayscale and downsampled to an 84 x 84 image. We also stack each of the four frames and repeat the same actions on the four consecutive frames as a way to infer motion.
The training loss & reward over 3 million iterations and about 3 thousand episodes is shown in the figure below:

As it's shown in the figure, we have reached a maximum of 6 as the mean reward in episdoes about 17500 and then started to diverged from there onwards.
Though the best model we got in our evaluations was in the eposidoe 21565 and the iteration 1880000 with a mean reward of 7.9 per episode.
This paper implements the deep learning model introduced in the DeepMind paper on DQN for reinforcement learning and demonstrates its ability to master difficult control policies for Breakout, using only raw pixels as input. The online Q-learning method we implemented combines stochastic mini-batch updates with experience replay memory to facilitate the training of deep networks for reinforcement learning. This approach succeeded in achieving human-level performance in the game of Breakout.
- Minibatch size: 32
- Replay memory size: 50000
- target network update frequency: 10000
- discount factor: 0.99
- learning rate for RMSProp: 0.00025
- gradient momentum: 0.95
- squared gradient momentum: 0.95
- min squared gradient: 0.01
- initial exploration: 1
- final exploration: 0.1
- final exploration frame: 1000000
- warm-up size: 1000
