Breaking Down Reinforcement Learning

A big name floating around in the machine learning world right now is Reinforcement Learning (RL). But what is it? How is it different from traditional deep learning? Why is it useful? By the time you finish reading this, I hope you have an understanding of at least that much as it related to reinforcement learning. Each section of the article will have a more beginner-friendly bit alongside a more technical bit with possible an equation or two and some references to further reading. This way you can pick and choose what concepts to investigate.

Because there are so many different methods to model a reinforcement learning problem and they vary wildly depending on the problem you’re trying to solve, we’re going to talk mostly about how models have been trained to play video games. It may not be the most applicable in the real world, but it does present some more interesting problems in terms of the environment the model interacts with and the massive number of actions the model has available to it. It’s also a lot more fun to watch a model play classic Atari games than it is to watch it move a robotic arm and pick up a cylinder. I will briefly discuss the other methods of training RL models, and leave some references for further reading.

What is reinforcement learning and how is it different from deep learning?

Where traditional deep learning methods usually revolve around a one to one map between input and output data, reinforcement learning is a form of machine learning that focuses on a cycle of

Cycle of reinforcement learning

A state, in this case, is a frozen moment in time that contains all the information the model needs to choose which action it will take next. States can be a representation of the current moment only, or, in many cases, is a set of past states and actions leading up to the current moment. An action is what the model has decided is the best way to proceed given the current state. An action could be very complex or could be to sit still and do nothing. A reward is what the model receives when it takes an action in the current state and the rewards accumulated over time are what the model is trying to maximize. The reward can be immediate, like picking up food, or it can be the culmination of a set of unrewarded actions the model previously took, like getting out of a maze.

In traditional deep learning, you have a set of inputs and (generally) a labeled set of outputs. Your input can be pictures, words, or just numbers that relate to some phenomena you’ve observed. The goal is usually to take these inputs and use different layers or formulas to transform them such that the output of the model matches the target labeled data you have. For example, you could use an LSTM based model to translate text between languages (Neural Machine Translation), or you can use a Convolution based neural network to segment images based on the objects they contain (Object Segmentation). In all of these cases, we have a model that is given one piece of data, outputs one piece of data, and doesn’t care at all about what data came before or the data that comes after the current example.

Reinforcement learning, on the other hand, works to maximize the potential future reward it will receive. It's common to not have a training dataset to begin with, so the model will often have to collect the data itself. The data consists of state, action, reward, new state sets which the model collects by choosing actions and interacting with the environment. At first, its actions will be random and it won’t get much of a reward. But as it accumulates more data, it will better understand the environment and reward structure. Eventually, it will be able to make choices that efficiently maximize its long term rewards.

In short: Deep learning focuses on maximizing accuracy (minimizing a loss function) on the current example, and will generally have a set of data to train on. Whereas Reinforcement learning maximizes the total reward for the task, not a single iteration. This means it will generally have to collect the data to learn from on its own.

A simple example could be the game snake: A deep learning model looking at this state would think “Okay, the red square is in front of me, so I’m going to go straight for it,'' because it has no concept of what’s going to happen after it gets that red dot. Whereas a reinforcement learning model says “Sure, the dot is there and the sooner I get to it the better. But if I went for it right now, my tail will extend, and I’ll probably die”.

Figure 1:

A simple example could be the game snake: A deep learning model looking at this state would think “Okay, the red square is in front of me, so I’m going to go straight for it,'' because it has no concept of what’s going to happen after it gets that red dot. Whereas a reinforcement learning model says “Sure, the dot is there and the sooner I get to it the better. But if I went for it right now, my tail will extend, and I’ll probably die”.

How does Reinforcement Learning actually work?

As I said in the introduction, I’m going to focus on the variant of Reinforcement Learning models that play video games (because they’re fun to look at) and these models generally optimize their behaviour through methods called Q-Learning and Policy Gradients.

Imagine the model, hereafter called an agent, is playing a video game. The game can only last for a certain amount of time, but you can still die before the time is up. This game environment could have an element of randomness to it, though the more random it is the harder it will be for the agent to learn how to play. At each moment in time, the agent can observe the current state of the game s -- just like you or I would look at a TV screen -- and then it chooses any action a from the set of legal actions available to it in that state -- like pressing a certain button on a controller. The agent then expects to receive a reward r that depends only on the action it took in that specific state (meaning if we encounter the same state and take the same action in a different ‘playthrough’, the reward should be the same). After the model completes this State, Action, Reward cycle, the state updates (think of it as the next frame of a video game) based on the action the agent just took.


Now, the challenge for our agent is to find a way of making decisions, called a policy π, such that we choose the action that will maximize the future reward. The policy in the case of Q-Learning is to choose the action in each state with the highest value. The value of an action in any state is equal to its immediate reward plus the value of all the possible actions/states you reach by taking that action. However, there's a catch! Anyone that’s taken an economics class will probably be familiar with The Time Value of Money. If you haven’t taken an econ class (I don’t blame you), it means  that money now, or an immediate reward in the case of our game playing agent, is worth more now than it would be in the future. Over time a rewards value decreases by a predetermined factor, usually represented by γs (gamma, not the letter y), where s is the number of steps in the future that the reward is received. So then we can say that the value of each action in a state that the model can find itself in would be equal to the reward it receives immediately, plus the discounted value of the possible states it can reach next. This works perfectly for a case where you know the reward you’d get for each state-action pair, but in most cases you won’t know ahead of time. Even if we did know all the possible future rewards are for one state, we have no idea for any of the other states are because we have no way to generalize. This is where machine learning comes in. We can use it to estimate the future rewards of each next state given an action and a current state. Over time, we update this value estimator so that the actor will eventually learn to predict the best action for each state, without having to perform each state-action combination.

Policy Gradient

In this approach, rather than trying to estimate the value of taking a certain action, the actor tried to learn the policy itself. So instead of training a model to estimate values from which we decide which action to take, we directly use the model to output the probability of taking each possible action given a certain state. Whenever the agent decides to take an action that produces a positive reward, the likelihood we make that decision again in that state is increased. Likewise, whenever the agent performs an action that leads to a negative reward, the likelihood that it will produce that action again next time it’s in that state is reduced.

Comparing the two

By nature, Q-Learning will converge to an optimal solution faster than policy gradient will for problems with a small number of less complex states. So it's optimal for small games like ATARI and Snake. But once you need to train an actor for a game like StarCraft, where the number of states and actions is far too large to be able to accurately learn a value for each, its generally thought that a policy gradient approach is best.

The challenges of Reinforcement learning

If your environment has an element of randomness to it, it could be a lot harder for the agent to learn an effective way to generalize what actions to take in different states. This effect is compounded if you are trying to model an environment that presents the agent with some kind of incomplete information - such as trying to predict stock market changes looking only at the ticker itself (disregarding news stories, etc.).

One of the biggest problems is the sparsity of rewards: the reward for a certain chain of actions can come long after you’ve completed those actions, so it can take much longer for the actor to realize it’s a specific chain of actions that leads to a better reward. It can take a long time for an agent to even explore the path that leads to the greatest reward.

When a reward structure is too sparse for the model to learn properly, or a mistake made by the model during training is too costly, you sometimes need to have an element of human intervention. This is called human-in-the-loop reinforcement learning, and it means that the model in question has its action outputs critiqued by a human actor as part of the reward function. Often when learning, a model will be unsure of its actions, making it very prone to mistakes. In these cases, it may be prudent to have a human present to give direct feedback and provide the correct course of action. For example, Autonomous Vehicles are one area where any mistake can be very costly, and a human is already present for feedback in the case that a vehicle is unsure of the actions it should take. If you’re interested in learning more about human-in-the-loop for autonomous vehicles, check out MIT’s human-centered autonomous vehicles.


If you want a quick example of how to get up and running with reinforcement learning, look no further than Andrej Karpathy’s (Tesla’s Head of AI) blog post about it. In it, he gives a great example of how to teach a fully connected network to play pong against a hardcoded AI in only 130 lines of code. It has a great explanation of each mechanic involved.

If you’re looking for something more complex, I’d recommend looking at AlphaStar - DeepMind’s Starcraft 2 AI. They tackle some of the hardest problems for a reinforcement learning model, such as working with missing information, long term planning, and continually discovering new strategies.

Screenshot of AlphaStar - DeepMind's Starcraft 2 AI


Reinforcement learning is an amazing tool for automatically discovering a good way of traversing an environment to accomplish a goal. In any space where you have a model that needs to continuously interact with an environment and you can devise a concrete reward function for it to use to learn, then reinforcement learning is a very good option for you. I hope you learned something new about Reinforcement Learning from this article and maybe found a few links that you thought might be worth a deeper look!