Human-Level Performance of DQN in Breakout

Abstract

We implement a deep learning model to successfully learn control policies in the game of Breakout from frames, based on the DeepMind DQN paper. The model is a convolutional neural network trained with a variant of Q-learning, where the input consists of raw pixels and the output is a value function estimating future rewards. We apply this method to the game of Breakout and achieve a human level of performance.

Introduction

The games of Atari serve as a benchmark for evaluating reinforcement learning algorithms due to their challenging dynamics and the necessity for effective control policies. In this study, we leverage the principles outlined in the DeepMind DQN paper to play the game of Breakout, which introduced a novel approach to combining deep learning with reinforcement learning. By utilizing convolutional neural networks, we aim to learn optimal control strategies directly from the raw pixel data of the game, thereby eliminating the need for handcrafted features.

The DQN architecture employs a deep neural network to approximate the Q-value function, which predicts the expected future rewards for each action given a specific state. This allows the model to make informed decisions based on the visual input it receives. Our implementation focuses on the unique challenges posed by Breakout, where the agent must learn to navigate the environment, track the ball, and strategically position the paddle to maximize its score.

Through a series of experiments, we demonstrate that our model not only learns to play Breakout effectively but also achieves performance levels comparable to those of human players. This showcases the potential of deep learning techniques in mastering complex tasks that require real-time decision-making based on visual inputs.

Methodologies

Q-Learning

Q-learning is one of the popular algorithms used in RL, since it can be used in model-free setups (both for evaluation and improvement) and it's also off-policy; meaning that we can make use of the expreinces from a given policy (namely the behaviour policy), to improve a desired target policy. The formula for the algorithm is as follows:

$ Q(s, a) \leftarrow Q(s, a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right) \space \text{eq.1} $

(where a is chosen from the behaviour policy and $ a' $ is chosen from the target policy. Also note that $ r + \gamma \max_{a'} Q(s', a') $ is usually called the target value)

Deep Reinforcement Learning

Only having Q-learning algorithm doesn't do any good for us, unless we are working with a very simple environment. since in most of the cases we are dealing with an environment with a large state space and action space, we can represent action-value (or sometimes called $\bold{Q-values}$) for each state-action pair in a table, hence we need some other approach to store, represent, and also learn $\bold{Q-values}$, for that we can use function approximation methods; one of the most used methods today is to use neural netowrks to approximate the $\bold{Q-values}$. Hence, neural networks (if the appropriate architecture used, and tuned correctly) are provide us a way to approximate our $\bold{Q-values}$.

The neural network that we have used in our work is explained in detail in Image Processing Network section.

Image Processing Network

Since our work is on a Atari game, for processing frame, we need to use a CNN. The specifications of the CNN we used is as follows:

First layer: a convolutional layer with input channel of 4, number of filters of 32 and kernel size of (8, 8), with an stride of 4, followed by a reLU.
Second layer: a convolutional layer with input channel of 32, number of filters of 64 and kernel size of (4, 4), with an stride of 2, followed by a reLU.
Third layer: a convolutional layer with input channel of 64, number of filters of 64 and kernel size of (3, 3), with an stride of 1, followed by a reLU and then a flattening.
Forth layer: a fully connected layer with input features of size 3136, output features of size 512, followed by a reLU.
Fifth and the last layer: a fully connected layer with input features of size 512, output features of size 4.

DQN Algorithm

Before diving into the DQN algorithm, we need to define a few things first:

Replay Buffer

As we said earlier, the Q-learning algorithm could be used in a off-policy setting, that means we can utilize the experinces that we have had from other policies in our work to improve our current policy. In other words, we can make use of our previous expreinces in the game (which have been caused by our previous policies) to train our agent again, and again. This technique is called is done using a Replay Buffer which is essentially a storage of our previous expreinces (which are of the form $ e_t = (s_t, a_t, r_t, s_{t+1}) $). More specifically, while we are interacting with the environment, we can sample a mini-batches of our previous expreinces from the Replay Buffer and improve our policy.

Our Replay Buffer is of size 50000 expreinces, and we sample mini-batches of size 32 from it. We don't take sample mini-bathces for the first few interactions of the agent with the environment (namely this is called the warm-up phase).

Target Network & Q-Network

In order to stablize the learning, we use two different networks (these two networks are the same expect for their weights. the target network weights get updated less frequently):

Q-Network: This is the primary network that we want to use in our work to pick the best actions while the agent is interacting with the environment. We will denote this network by $ Q $ and its weights by $ \theta $
Target Network: This network will be used when we want to approximate the target value (defined in Q-Learning section in eq.1). The weights of this network are updated less frequently in order to stablize our algorithm. We will denote this network by $ \hat{Q} $ and its weights by $ \theta^- $

$ \epsilon $-greedy policy

Since Exploration-Exploiation is one of the main issues of Reinforcement Learning algorithms which are based on Value-Iteration, we consider an decaying variable (most of the times called $ \epsilon $) which controls the probability of taking a random action while interacting with the environment to help the agent explore the environment good enough to be able to have a successful exploitation phase later on.

We will explain how exactly we will use the decaying $ \epsilon $ in our work in DQN Algorithm section, but here we will provide the specifications of how we initialize and decay $ \epsilon $:

initialization value: $ 1.0 $, final value: $ 0.1 $, the number of frames used to decay: $ 10^6 $ (we decay the $ \epsilon $ linearly).

Optimizer

We will use RMSprop as the optimizer for this work, for details on hyperparamters please refer to Appendix section.

Pseudocode

The following is the pseudocode of the DQN algorithm:

$ \text{Initialize replay memory } D \text{ to capacity } N \ \text{Initialize action-value function } Q \text{ with random weights } \theta \ \text{Initialize action-value function } Q' \text{ with weights } \theta^- = \theta\ \text{For each episode:} \ \quad \text{Initialize sequence } s_1 = {x_1} \text{ and preprocessed sequence } \phi_1 = \phi(s_1) \ \quad \text{For } t = 1 \text{ to } T: \ \quad \quad \text{With probability } \epsilon \text{ select a random action } a_t \ \quad \quad \text{Otherwise select } a_t = \arg\max_a Q(\phi_t, a; \theta) \ \quad \quad \text{Execute action } a_t \text{ and observe reward } r_t \text{ and new image } x_{t+1} \ \quad \quad \text{Set } s_{t+1} = s_t, a_t, x_{t+1} \text{ and preprocess } \phi_{t+1} = \text{process}(s_{t+1}) \ \quad \quad \text{Store transition } ( \phi_t, a_t, r_t, \phi_{t+1} ) \text{ in } D \ \quad \quad \text{Sample random minibatch of transitions } ( \phi_j, a_j, r_j, \phi_{j+1} ) \text{ from } D \ \quad \quad \text{Set } y_j = \begin{cases} r_j & \text{if episode terminates at step } j+1 \ r_j + \gamma \max_{a'} \hat{Q}(\phi_{j+1}, a'; \theta^-) & \text{otherwise} \end{cases} \ \quad \quad \text{Perform a gradient descent step on } (y_j - Q(\phi_j, a_j; \theta))^2 \ \quad \text{Every C steps reset } \hat{Q} = Q \ \text{End For} $

Preprocesing

Working directly with raw Atari frames, which are 210 x 160 RGB pixel images, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality. First, to encode a single frame we take the maximum value for each pixel color value over the frame being encoded and the previous frame. The frame is then converted to grayscale and downsampled to an 84 x 84 image. We also stack each of the four frames and repeat the same actions on the four consecutive frames as a way to infer motion.

Results

The training loss & reward over 3 million iterations and about 3 thousand episodes is shown in the figure below:

As it's shown in the figure, we have reached a maximum of 6 as the mean reward in episdoes about 17500 and then started to diverged from there onwards.

Though the best model we got in our evaluations was in the eposidoe 21565 and the iteration 1880000 with a mean reward of 7.9 per episode.

Conclusion

This paper implements the deep learning model introduced in the DeepMind paper on DQN for reinforcement learning and demonstrates its ability to master difficult control policies for Breakout, using only raw pixels as input. The online Q-learning method we implemented combines stochastic mini-batch updates with experience replay memory to facilitate the training of deep networks for reinforcement learning. This approach succeeded in achieving human-level performance in the game of Breakout.

Appendix

Hyperparameters

Minibatch size: 32
Replay memory size: 50000
target network update frequency: 10000
discount factor: 0.99
learning rate for RMSProp: 0.00025
gradient momentum: 0.95
squared gradient momentum: 0.95
min squared gradient: 0.01
initial exploration: 1
final exploration: 0.1
final exploration frame: 1000000
warm-up size: 1000

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
images		images
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
eval.py		eval.py
init_dirs.py		init_dirs.py
jupyer.ipynb		jupyer.ipynb
memory.py		memory.py
model.py		model.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py
train_log.py		train_log.py
wrappers.py		wrappers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Human-Level Performance of DQN in Breakout

Abstract

Introduction

Methodologies

Q-Learning

Deep Reinforcement Learning

Image Processing Network

DQN Algorithm

Replay Buffer

Target Network & Q-Network

$ \epsilon $-greedy policy

Optimizer

Pseudocode

Preprocesing

Results

Conclusion

Appendix

Hyperparameters

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Human-Level Performance of DQN in Breakout

Abstract

Introduction

Methodologies

Q-Learning

Deep Reinforcement Learning

Image Processing Network

DQN Algorithm

Replay Buffer

Target Network & Q-Network

$ \epsilon $-greedy policy

Optimizer

Pseudocode

Preprocesing

Results

Conclusion

Appendix

Hyperparameters

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages