Feedback Sequence-to-Sequence Model – Gonna Reverse Them All!

This tutorial assumes that you have a pretty good understanding about the basics of Recursive Neural Networks and Backpropagation Through Time (BPTT) and how these models actually work


One-to-one: Problems that are concerned into getting a direct relation between an input word and output word. For example, the relation (like, love) is considered to be a semantic-level relation which represents that the 2 words have somehow the same meaning. The solution to this problem is building a model that takes an input word and produce an output word so that the model learns the nature of the relationship between the 2 words.

One-to-many:  Problems that are concerned into getting a direct relation between an input word and several words (context). The relation (vehicle, [car, bike, boat]) is a hypernym relation.

Many-to-one: Problems that are concerned into getting a relation between several words (context) and  a word (entity, class, ..etc). The relation([bike, car, boat], vehicle) is a hyponym relation. Also, Sentiment Analysis is considered to be a many-to-one problem because given a sentence which consists of several words, the output shall be one word that describes the polarity (negative, positive, neutral) of this sentence.

Many-to-many: Problems that are concerned into getting a relation between several words (context) and other several words (context). The relation (My name is Ahmed, Je m’appelle Ahmed) is a English-to-French translation relation.

Word Embedding: A technique to represent the words by fixed-size vectors, so that the words which have similar or close meaning have close vectors (i.e vectors that their euclidean distance is small).

One-hot Encoding: A naive word representation technique. Non-semantic representation approach. For each word in the corpus, we represent it with a sparse vector. The size of the vector is the vocabulary size. Each word is uniquely identified by setting “1” in a unique index, so that the words have different vectors.


At first, let’s define our dataset that we will go through on this tutorial. As mentioned, we want to apply a model to learn string reversing problem. So, our dataset should be consists of the characters that would be used to make the strings. For simplicity, we will assume that the characters size are only 3, which are ‘a’, ‘b’ and ‘c’. All strings that the model can handle should be only contains these characters.

We need to define the word representation for our corpus. For simplicity too, we will use One-hot encoding technique for the characters representation. Actually, we don’t have to use Embedding for such task as we don’t care about the semantic level of the generated words. We only care about reversing the characters, and the reversing doesn’t need to understand the words semantics.

So, our representation would be

[1, 0, 0] = ‘a’
[0, 1, 0] = ‘b’
[0, 0, 1] = ‘c’

So, if we have the string “abc”, the vector representation of would be [1, 0, 0, 0, 1, 0, 0, 0, 1].

Encoder-Decoder Architecture 

Any sequence-to-sequence (seq2seq for shorten) model consists of 2 components; an encoder and a decoder. Both of these components are regularly a type of Recursive Neural Network, it could be a simple Recurrent Neural Network (RNN) or Long short-term Memory (LSTM). They do the same work, but the number of trainable parameters differs from a network to the other.

The encoder tries to get a vector representation for the given context, for each input, it calculates the hidden layer state, and this state is changed for every new input word. So, theoretically, the final vector should contain all the information of the input encoded as a fixed-size vector. This vector is called “Thought Vector”.

After finishing the encoding phase, the decoder takes the thought vector and use it as the initial state. The decoder continues to produce output words using the Softmax layer, till whether it produces a specific token that is determined by the user, something like “END” which tells the decoder to stop produce output or by setting the maximum length of the output words.

We will show how the encoding and decoding components work in the string reversing problem.

Note that, There are multiple different styles of Sequence-to-Sequence models. As far as I know, there exist 4 different styles. The difference always in the decoder component and its input criteria. On this post, I illustrate the Feedback Encoder-Decoder model, in which the output of the decoder in a time step becomes an input to it in the next time step..

The Encoder

(Click on it to see in a better resolution)

At first, lets define our model parameters. From the above figure, we have the following

Batch size = 1
Input shape = 1 x 3
Hidden layer size (S(t)) = 5
Input-to-hidden weights (Wh) shape = 3 x 5
Hidden-to-hidden weights (Ws) shape = 5 x 5
Hidden layer activation function: Rectified Linear Unit (ReLU): f(x) = max(0, x);

We need to initialize the weights of our model. Regularly, the weights are initialized whether in Gaussian or Uniform Distributions. But for simplicity, we will just initialize it by the following

Wh =

Ws =

As we mentioned, the target of the Encoder is getting a vector representation for the given context, so, for each time step, we feed the network with 1 character after transforming it as a vector of real-number values by the One-hot encoding. We can conclude from that is that for each character we will update the hidden layer state will we reach the end of characters

To calculate the values of the hidden layer we use S(t) = x(t) * Wh + S(t – 1) * Ws

Now, lets do the math for each time step of the string ‘abc’

S(1) = x(1) * Wh + S(0) * Ws

S(1) = [0.1, 0.2, 0.3, 0.4, 0.5]

Note that as there is no input before the first character, we initialize the hidden layer state by a vector of zeros. You can think of it as the network still has no memory about what it learns because there was no previous input that is fed to the network before the first character. So, Ws * S(0) would be a vector of zeros.


S(2) = x(2) * Wh + S(1) * Ws

S(2) = [0.98, 1.28, 1.65, 1.74, 0.58]


S(3) = x(3) * Wh + S(2) * Ws

= [1.74, 2.16, 4.56, 4.62, 3.29]

Now, we have successfully finished the encoding process. The encoder’s thought vector is [1.74, 2.16, 4.56, 4.62, 3.29] which represents the string ‘abc’ that we want to reverse.

Note that, as the weights values are all positive values, using ReLU won’t have an effect on the values as they are all positive numbers greater than 0.

This is considered to be a mapping problem, so, we could just use Multi-layer Perceptron (MLP) to solve it as long as we define the maximum length of characters per string. What I want to say is that this problem doesn’t show the real power of seq2seq models. The target was only show how this model works and how the encoder and decoder cooperate to produce the output.

The Decoder

(Click on it to see in a better resolution)

We have successfully encoded the input sequence to a fixed-size vector representation. Now, we shall begin the decoding process. In the above figure, you see an RNN unit which has a similar structure of the encoder. The decoder has 2 different properties of the encoder. The first one is the initial state. The initial state of the decoder is considered to be the final state of the encoder. In other words, we take the vector representation of the input context and use it as a start state of the decoder. This seems logical, because we want to tell the decoder the last state of the encoder to be able to produce a character for each time step.

The second important difference is that the input of the decoder at time (t + 1) is the output at time (t). The decoder uses its previous output to produce the output. This also seems logical. Recalling Recurrent Neural Network Language Model (RNN-LM), we use the previous words to be able to produce the current words. This is the same case here.

As we did before, we have the following

Input shape = 1 x 3
Hidden layer size (_S(t)) = 5
Input-to-hidden weights (_Wh) shape = 3 x 5
Hidden-to-hidden weights (_Ws) shape = 5 x 5
Hidden layer activation function: Rectified Linear Unit (ReLU): f(x) = max(0, x);
Hidden-to-output weights (_Wo) shape = 5 x 3
Output layer activation function: Softmax: 

For simplicity, we will initialize the model weights by the same encoder’s values., in addition to the _Wo values

_Wh =

_Ws =

_W0 =

Now, lets do some math here too.

First, we start by empty character. Because there is no previous output. So, we only will depend on the encoder last hidden state and its weight.

_S(1) = x(1) * _Wh + S(0) * _Ws

_S(1) = [3.83, 6.79, 9.66, 9.58, 6.31]

Now, we have the decoder’s hidden layer values. To get the output values, we multiply the hidden layer values with the output weights.

To know the output character, we transform the output vector to probabilities using the Softmax activation function.

After applying the Softmax, the vector will be approx. O = [0.03, 0.15, 0.81] 

To know the actual character, we take the index of the maximum value, which is 0.81 in this case, which corresponds to the vector [0, 0, 1] which is the representation of character ‘c’.

We use this character as the new input to the decoder.

_S(2) = x(2) * _Wh + S(1) * _Ws

_S(1) = [9.44, 16.05, 22.68, 22.78, 14.41]

Multiplying the hidden values with the output weights

Applying Softmax will get the character ‘c’

_S(3) = x(3) * _Wh + S(2) * _Ws

_S(3) =
[21.91, 37.61, 52.35, 52.62, 32.80]

Multiplying the hidden values with the output weights

Applying Softmax will get the character ‘c’.

So, the output of the decoder is the sequence ‘ccc’ which is not the reverse of the  original string ‘abc’. This is expected as we haven’t trained the model yet and the weights haven’t updated yet.

After obtaining the sequence, we get the error and backpropagate that error to update the weights using Gradient Descent optimization algorithm.


– The number of trainable parameters of the above model are 108 calculated by the following

Encoder (45):
1) Input-to-hidden: 5 * 3 + 5(bias)
2) Hidden-t0-hidden: 5 * 5

Decoder (63):
1) Input-to-hidden: 5 * 3 + 5(bias)
2) Hidden-t0-hidden: 5 * 5
3) Hidden-to-output: 5 * 3 + 3(bias)

– The length of the input doesn’t necessary be the same of the output length. In that case, the decoder will continue produce output tokens till it fires a specific token that is determined by the model creator

– The above model could be replaced by Auto-encoder to get the input sequence representation as long as we know the number of input words or tokens. And the bottle-neck features would be the thought vector of the input. Then we use it in the decoder as its initial state

You know what? I am in deep love with sequence-to-sequence models. At the past, I always felt stupid while working on any sequence to sequence problem. Why? BECAUSE I NEVER FULLY UNDERSTOOD HOW IT WORKS. This is so complicated. I loved something that I was bad with, not only bad, but I felt stupid the most when working on any related problem to it.

As we are humans, we always try to correct and fix our relationships with things that we aren’t good with as long as we love or has an interest in it. So, I tried to make a fix with these models so that I can understand how they work, and also that would help me in different problems in Natural Language Processing.

That’s why I took the decision to make an intensive study and research to fully understand it and the other seq2seq different model styles.

On this post, I illustrated one of the four models of seq2seq. I will try to make a post for each of the remaining models.




Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s