The figures are taken from this great blog post by Christopher Olah
Recurrent Neural Networks
Recurrent Neural Networks (RNN) is a type of Neural Networks (NN) that is commonly used in problems that depend on sequential data. In sequential data, we should assume that the data is dependent to each other. For example, if we have a sentence that contains some words, to be able to predict any word of it, we need to memorize the previous words, because the sentence words are naturally homogeneous, in grammar or part-of-speech (POS), with each other.
Traditionally in the regular Multi-layer Perceptron (MLP) Neural Networks, we assume that the data is independent to each other, which is a wrong with some data like text or sound.
RNNs have capabilities to “memorize” the previous data, as it contains self-loops. It saves the state and use it with the new data. This helps the network to take care of the dependencies between the data and take them into consideration when predicting the output.
The following is a 1-unit of a RNN, A . It has an input X and output h.
As you have figured, it is very similar to the traditional neuron in MLP, but it differs in the self-loop at the unit A. The self-loop holds the previous state of the neuron and fed it with the new input.
We can unroll (unfold) the figured model. You can see that it isn’t much different with the traditional MLP model.
Through time, the self-loop values stores the previous experience of the previous data, and use it with the new input to obtain the predicted output. This helps to memorize the dependencies between the data together.
Long-term Dependencies Problems with Regular RNNs
As you have figured, there are no restrictions when saving the memory of the previous data, the network keeps saving the previous data without identifying whether this memory would be helpful in the next iterations or not. However, practically, this isn’t always right in the real sequential data.
Assume that we have a sentence like that. “I live in France, I like playing football with my friends and going to the school, I speak french”
Assume that we want to predict the word “french”, we don’t need to look to the previous two terms, ” I like playing football with my friends” and “going to the school”, we need only to know that “I live in France”, and ignore the unnecessary context that may confuse the network while training it.
Practically, Regular RNN can’t connect the related information and dependencies together, specifically if the information has some noise within it that could avoid us from the actual target.
This problem is the main reason that pushed the scientists to develop and invent a new variant of the RNN model, called Long short-term Memory (LSTM). LSTM can solve this problem, because it controls the “memorizing” process within its units using something like “gates”.
What is LSTM?
LSTM is a kind of RNN that is revolutionary used on many fields on computer science such as Natural Language Processing and Speech Recognition. Because of its capabilities of avoiding the problem of “long-term dependencies”
Like any other RNN, LSTM has the same idea of the self-loops. But, LSTM shines from the other RNN in that each unit (neuron) contains some layers and gates that are specified by the user. Each of these layers and gates controls the output and the state of the neuron.
Regularly, when human read a paragraph or a topic, they can easily extract the dependencies between the sentences that formulate the text. In stories and novels, you can match between the events that happen in the story, and extract the much important events and connect them together to be able to understand the story end.
Human brains can identify and memorize the importance and dependencies between words of the sentence, and determines the POS tags of them. If you see the a subject in the beginning of the sentence, then your brain most likely predict that the next word has a great chance to be a verb(VP) or a noun phrase(NP) the describes the subject, because you memorize that the previous word is a subject, and you don’t need to look what context is before the subject, as you determined that subject is your beginning of the a new context to predict the the next word POS.
This is how LSTMs works, they simulate the same process of this ability in our brains to be able to connect the important or related objects together, and forget the unnecessary objects in the context.
LSTM Unit Structure
This is a standard structure of LSTM unit. It contains:-
- 2 input (previous state C_t-1 and previous output h_t-1)
- 4 layers (3 sigmoid and 1 tanh activations)
- 5 point operators (3 multiplications, 1 addition and 1 tanh operators)
- 2 output (current state C_t and current output h_t)
The most important thing in LSTM is the state. The state represents the information stored in the since the training begins. We control the memorizing process by updating this state. If we want to forget the previous state, then we make it 0, if we want to control the amount of memorized information, then we update the values of it during the training process. Next, we will discuss how is the output and update are done.
The user can change this structure according the problem they want to solve. This is just an example of a standard LSTM unit.
Detailed Explanation of LSTM Processing
We can divide the internal processing within the unit into 3 groups. Each of these groups performs layers and operators processes to produce the current state and the output.
Group 1.1: Input with the previous output for forgetting
This sigmoid activation layer is called “forget gate layer”, because it decides whether it forgets the previous state, in this case, the activation output would be 0 for each element of the state vector, or we use the previous state, in this case the elements values would be higher than 0 and less or equal than 1.
Group 1.2: Forget gate layer with the previous state
To know what we need to forget from the previous state, we multiply the output of the “forget gate layer” with the previous state (element by element multiplication). If it produces a vector that is full of zeros, it means we want to forget all of the previous memories and initiate a new memory from the current input. This goes as follows.
Group 2.1: Input with previous output for the new information
Firstly, we need to add the new input that would be used to update the state. We called the sigmoid activation layer by “input gate layer”. And it decides which values of the state vector would be updated.
Secondly, we need to generate a new state, called candidate state, that would be added to the previous state (which could be full of 0s or values depends on group 1.1 output).
Group 2.2: Scaling new state with the input gate scalar.
We multiply the new generated state with the input gate layer. This is like a scaling process, because we edit the values of the new generated state by the needed update factors to get the new information, which would be added to the previous state to get the whole state.
Adding groups 1 and 2 to get the current new state
To generate the new current state, we add the new generate state to the previous state. This state would be fed to the next unit (current unit with the next input).
Group 3.1: Getting the unit output
When we produce the output, we need to use the state and input to help the next unit on what it will use to produce its state.
Firstly, we use the weighted sum of the concatenated previous output and input, then apply the sigmoid function. This decides which parts of the state we need to output.
Secondly, we use tanh operator to make sure that the values of the current state is within -1 and 1.
Finally, we get the unit output that has some parts of the state and input.
Nowadays, LSTM is used in very wide fields in computer science, in Machine Learning specifically. It practically proved itself in some hot research topics such as Language Modeling, Sentiment Analysis, Speech Recognition, Text Summarization and Question Answering.