Introduction to Reinforcement Learning

Hey, what’s up guys?

Today is the beginning of a series about an interesting field of science and new for me. The series is about Reinforcement Learning.

The reason why I have started to learn this topic is that I have known about supervised and unsupervised learning in through Andrew NG Machine Learning course, I thought that these two techniques are the only available approaches for machine learning (regardless Evolutionary and Swarm methods), but later I found that there are other approaches like Semi-supervised and Reinforcement Learning, so, I decided to get through Reinforcement Learning to know more about it and its algorithms.

In this series I explain what I learnt about Reinforcement Learning and mention the resources I use during my study provided with my code implementation. The main reasons for these blogging is that I didn’t find many blogs talking about this field, it will motivate me to finish what I want to learn and also to memorize what I studied, you can consider them as documentations for each chapter.

Now, let’s get started.

Chapter 1: Introduction to Reinforcement Learning

Before we get into our topic, we need to talk about Supervised and Unsupervised Learning and mention the differences between each of these 2 techniques as they are very important to define the meaning of Reinforcement learning.

In supervised learning, we have a training data –could be images or numerical values– , it consists of some of training examples that were obtained before, training data is used to train the classifier which could be any known classifier such as SVM, KNN ..Etc. or we may use Neural Nets. The classifier uses this labeled data to help on identifying and classifying new data –known as test data–, the main factor for classification is extracting the object’s features –known as feature vector– that are considered to be the descriptors of the object.

In unsupervised learning, you don’t have a training data, you are given set of unlabeled data, and the task is to cluster these data into separate classes, clustering is based on the features of the given items, the well-known algorithm for clustering is K-means algorithms and its variants, also, the famous Hidden Markov Models (HMM) are used too in unsupervised learning.

On the other hand, reinforcement learning is simply learning what to do to get the best benefit in the current situation, the agent has the ability to learn from the current environment after several trials. Back to supervised learning, you will find that training data is a must for classification, and that isn’t practically in some of interactive problem in which the agent should decide what to do in the run time on the current environment. The solution is that the agent should has the ability from its experience instead of using training data and that is introduced in reinforcement learning.

As in supervised and unsupervised learning, reinforcement learning has some basic characteristics that defines and describes its process, we can conclude them in 3 points:-

  • Trials: In which the agent uses many trials in the same environment to obtain and predict the opponent’s behavior (like in chess game)
  • Error: After each situation or state in the environment, the agent calculates the error in its search to minimize it in a similar situation
  • Reward: The benefits which the agent will obtain after perform the action

Also, the agent which uses reinforcement learning has two important choices when choosing its action or prediction:-

  • Exploitation: The ability to use a previous successful experience in the current situation
  • Exploration: The ability to explore other solutions or paths which could lead to better reward

When we build an agent which uses reinforcement learning to perform some actions, we need at first to define some elements that the agent uses to learn in the environment. Scientists identified main 4 elements for the learning system:-

  • Policy
  • Reward Function
  • Value Function
  • Model (optional)

We will talk about each of these elements, and to fully understand these elements we will use Tic-Tac-Toe as an example.

Policy defines the agent’s behavior and attitude in the environment in a specific state. In Tic-Tac-Toe, the agent should play ‘X’ in each of its turn, this is the policy of the agent. After consider this policy, an estimation should be done to estimate the winning probability for each move.

Reward Function defines the benefit or desirability of the next state –only– after agent’s move, the higher of the reward, the better solution. In Tic-Tac-Toe, When the agent plays ‘X’, it estimates the reward which will be obtained after being in the resultant state, on other words, higher reward means a good move, and lower reward indicates a bad move.

Value Function defines the total rewards in the long run of all the states, and that is different from the Reward Function which only focuses in the next state only. We can obtain from that the Value Function is more important that the Reward Function.

Model is an optional element when we build an agent and only appears in Planning that is used in games, the model predicts the next state of the current environment given the current state and the action which the agent will perform. This simulation or prediction effects on the decision making process of the agent.

Note: A good agent may use a lower reward in a specific state if this state will lead it to be in other higher Rewarded state, and that could give a better Value Function.

The following shows a diagram of an agent in Tic-Tac-Toe uses reinforcement learning to perform its next moves


  • The opponent started the game in point ‘a’, turning the game state to be in state ‘b’
  • In state ‘b’ the agent checked all possible states which is presented as dashed lines, then selected –the policy– one of them presented as state ‘c’
  • The opponent played its move, turning the game to state ‘d’
  • In state ‘d’, the agent made an exploratory move that leaded it to be in state ‘e*’, but the agent found that it wasn’t a good state which won’t resultant for learning
  • The agent backtracked and chose another state ‘e’
  • The curved lines donates a backup for the state, the reason for that is to get better winning estimation after each move, so, the backup is an option for the agent


In the next chapter, we will talk about one of reinforcement learning application.