Explain it to me like a 5-year-old: Deep Sequence Modeling

Introduction to Recurrent Neural Networks: Part 1/2

Ameya Shanbhag
10 min readMar 9, 2021
Photo by Maximalfocus on Unsplash

I am a Product Manager with some background in Deep Learning and Data Science. Deep Learning is daunting and I am trying to make it as intuitive as possible. Feel free to reach out to me or comment if you think any of the content below needs correction-I am open to feedback.

My intention here is to make understanding Deep Learning easy and fun instead of mathematical and comprehensive. I haven’t explained everything in detail, but I am sure by the end of the blog, you will be able to understand how a Neural Network works and you will NEVER FORGET IT!!! ;)

Keep an eye out for the intuition section which provides real-world analogies to deep learning

Feel free to connect with me on LinkedIn!

Why do we need Deep Sequence Modeling?

What we studied earlier, was how a neural network is used for prediction using feed forward and backpropagation mechanism. If you notice, we had taken an example of using multiple factors like location, # of bedrooms, total area, pet-friendly, distance from work etc to predict the price of the house. In this example, we are calculating the price of the house AS OF ONE PARTICULAR TIMESTAMP — mainly our input would be something like (New York, 3, 1200, Yes, 20mins) and our output would be ($4000000). As you can observe, the inputs are as of one particular timestamp.

Another example — Is like giving you a picture of a ball and asking where will it go next? It can go in any direction. But what if I give you the trajectory of the ball and ask you where would it go next? Your brain will relate ball’s past trajectory and give you an intuition saying that the ball will go to the right (Fig 1.1)

Fig 1.1 MIT 6.S191

Deep Sequence models are used when your model has to remember information across different timestamps, understand the relationship between them and then predict what will be the next best move.
Few applications of Deep Sequence Models:

  • Stock Prediction
  • Medical diagnostics
  • Climate change
  • Autonomous driving etc.

Types of Deep Sequence Models:

Fig 1.2 MIT 6.S191
  1. Many to One
    A model where your network is fed multiple inputs and is expected to output just one value e.g. sentiment classification (you feed your model with a continuous string of words and expect to output one sentiment)
  2. One to Many:
    A model where your network is fed only one input and is expected to output values of variable lengths e.g. image captioning (you give your model one image and your model will output a description of what the image is about)
  3. Many to Many:
    A model where your network is fed multiple inputs and is expected to output values of variable length e.g. machine translation (you give your model an English sentence and your model will translate it to French)

Designing Sequence Models:

To design a sequence model, we need to:

  1. Handle variable-length sequences
  2. Track long-term dependencies
  3. Maintain information about the order
  4. Share parameters across the sequence

Recurrent Neural Networks meet these sequence modeling design criteria

Recurrent Neural Networks:

Fig 2.1 MIT 6.S191

As daunting as the above diagram might look like, it is the easiest diagram and all you need to know to understand how RNNs work. Let me break it down..

Step 1: Basic Neural Network: We designed a basic neural network looking exactly like the one shown in step 1 in our previous post here — you can consider the inputs (x1,x2,x3) as (location, sq. ft and house type) and outputs (y1, y2, y3, y4) as (price, safety status, appraisal value, # of people it can accommodate)

Step 2: Simplifying basic NN: For simplicity purposes, let's combine all the inputs (x1,x2,x3..) into one variable x(t), we can call it input at a timestamp t. Let's remove the complexity of drawing the hidden layers by substituting them with one green box and let's combine all the outputs to one variable y(t), we can call it output at the timestamp t. All we are doing here is just combining inputs and outputs so that it looks simple.

Step 3: Turn it upside down: Again, we still have our simplified NN(neural network) from Step 2 and here we are only turning it upside down.

Step 4: Multiple NNs: This is the fun part. Consider multiple NNs from Step 3 stacked side-by-side and joined by one small string — that’s it, you have designed an RNN. As the name goes, there is a recurrent flow of NNs that are joined by a string, what we call a “state” (h). In the diagram shown above, consider y0 and y1 as intermediate output at a particular timestamp t and y2 as the final output. RNNs can also be depicted by the figure to the left of the vertical line in Step 4.

Intuition: Consider a relay race where usually there is a team of 4–5 people each one of them standing at different checkpoints. The first person to start the race carries a baton which he/she has to pass to another team member once they both meet at the checkpoint and ultimately complete the race with a baton.
In the case of RNN, you can consider

  • The participants as the neural networks(x)
  • Baton that is being passed as the state (h)
  • Stopping of one player and starting of another player after baton is passed as intermediate output(y0,y1)
  • Completing the race as predicting output (y2)

Easy Mathematics behind RNN:

I know that I emphasized not using equations but the only reason I am including equations here is that they are easy and we have already developed the intuition that we need to come up with these equations — I promise all the equations are intuitive. Let me explain!

Fig 2.2 MIT 6.S191

This is the same figure as Fig 2.1 but just with more notations:

  • Input vector x(t)
    It's a normal vector that consists of your inputs — nothing fancy here
  • Weights
    W(xh) -> Weights that are used to transform input in a way that is consumable by the hidden state
    W(hh) -> Weights that define the relationship between the previous hidden state and the current hidden state
    W(yh) ->Weights that are used to transform the hidden state output to a prediction-based output.
  • Hidden State h(t)
    You remember the example of the baton being passed from player to player in a relay race — that is what the hidden state does. As you move from one NN to another NN, the hidden state captures all the information from the current NN and passes it along so that the next NN is aware of what the context is when predicting something. Equation #2 in the diagram says that the hidden state (h(t)) is a parameterized function of the input to that NN (x(t)) and previous hidden state (h(t-1)) — here the input helps in understanding what to predict and the previous hidden state gives your NN some context regarding whatever is happening in your model. Parameterized just means that they are defined using the weight matrices (explained above).
    The function can be any NON-LINEAR function — this is the same concept we talked about here on the makeup applied to your decision. In the above diagram, the non-linear function used is tanh also called the hyperbolic activation function.
    Note: The same function and set of parameters are used at every time step. RNNs have a state h(t), that is updated at each time step as a sequence is processed
  • Output Vector y(t)
    The output vector is nothing but uses the current hidden state (h(t)) and W(hy) which basically helps transform the hidden state to output predicted values as shown in equation #1.
  • Loss function (L, L0, L1, L2 etc..)
    As with any model, you need to minimize the loss in order for your model to perform better which basically means learning from your mistakes. The L0, L1, L2, etc. that you see in the diagram are the losses calculated at each stage and summed up to calculate L and then our task is to minimize that final loss function using backpropagation

Why do I keep hearing about encoding my input?

Till now we discussed what happens AFTER we feed input to the models but have you thought that how does a model ingests the input? We all know that machines do no understand language, they understand numbers. For example, even if feed my model with this string “This morning I took my cat for a” and the model outputs “walk”, beneath that lovely sentence is a mixture of numbers.

Fig 2.3 MIT 6.S191

That’s exactly what encoding is. Encoding is a process of converting a string to a sequence of numbers understandable by the machine and decoding is a process of converting the numbers to a human-readable string

How do we do that? It is easy —

  • Step 1: We first take a set of unique words
  • Step 2: Allocate each word with a unique number
  • Step 3: Represent our input in a vector form using Step1 and Step 2 (see figure below) → This step is basically called embedding
Fig 2.4 MIT 6.S191

As you can see, the diagram above shows two types of embedding: One-hot encoding and Learned embedding (just different ways to represent your inputs in a vectorized form)

Backpropagation again?

Yes, but the process is exactly the same as the previous one here. One easy way to remember what backpropagation does is to remember why we retrospect as humans. For example, after an interview, we sit back and retrospect on how things might have gone differently if you would have answered in a different way — basically understanding where you fell short and learn from it. Just apply the same concept here — backpropagation is a process of a model to learn from its mistake (by updating weights) and try minimizing the loss. The only difference, in this case, is that here the backpropagation is THROUGH TIME, meaning the loss is backpropagated to each model and also across the model because it is a SEQUENCE model.

Fig 2.5 MIT 6.S191

Problems with RNN? Yay!!

Holistically how a backpropagation algorithm works is by calculating the gradient (i.e. derivative of final loss function w.r.t. each parameter) and then shift the parameters in order to minimize loss

Fig 2.6 MIT 6.S191

Below is the simplified version of Fig 2.6:

Fig 2.7 MIT 6.S191

Imagine calculating the gradient (a derivative of loss function) w.r.t h0 -> this will involve many factors of W(hh) and repeated gradient computation at each neural network

Intuition: Remember playing a party game in which one person whispers a message to the person next to them and the story is then passed progressively to several others, with inaccuracies accumulating as the game goes on. The point of the game is the amusement obtained from the last player’s announcement of the story they heard, that typically being nothing like the original. That is exactly what happens with RNNs — calculating gradients while propagating backward becomes difficult as there is a chance of losing information as we go backward

Challenges faced by RNNs:

  1. Exploding gradients: When many gradient values are >1
    Occurs when large error gradients accumulate and result in very large updates to neural network model weights during training. Gradients are used during training to update the network weights and it works best when these updates are small and controlled. When the magnitudes of the gradients accumulate, an unstable network is likely to occur, which can cause poor prediction results or even a model that reports nothing useful whatsoever. There are methods to fix exploding gradients, which include gradient clipping and weight regularization, among others.
  2. Vanishing gradients: When many gradient values are <1
    Since the gradients control how much the network learns during training, if the gradients are very small or zero, then little to no training can take place, leading to poor predictive performance. This also leads to capturing short-term dependencies instead of long-term dependencies.
    Potential Solutions:
    1. Activation Functions: Using ReLU prevents gradients from shrinking when x>0
    2. Parameter Initialization: Initialize weights to identity matrix and biases to zero — prevents weights from shrinking to zero
    3. Gated Cells: In the green box that we have encountered in the previous diagrams, use some logic inside them (i.e. gated cells) which will control what information is passed through. Based on the logic used in the gated cells we classify them as LSTMs, GRUs etc.

That’s all folks! In the next part, we will go over LSTMs and Attention Models!

Feel free to jump to https://www.youtube.com/watch?v=qjrad0V0uJE&ab_channel=AlexanderAmini lecture by Ava :)

Do check out my other posts to gain more knowledge about finance and technology.

Please do let me know if there is any other concept in deep learning you want me to write an article on, I will try my best to explain it in simpler terms.

Also, feel free to ask questions in the comment section. Will be happy to help you out :)

PS: The analogy I have used might not be 100% correct but it’s easy to understand things with a simpler analogy.

Credits: MIT Open Courseware

--

--

Responses (3)