Very Deep Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Very Deep Learning

Lecture 09

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker


MindGarage, University of Kaiserslautern
[email protected]

M. Zeshan Afzal, Very Deep Learning Ch. 9


Recap

M. Zeshan Afzal, Very Deep Learning Ch. 9 2


Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT


TREE, SKY

No objects, just pixels Single Object Multiple Object This image is CC0 public domain

M. Zeshan Afzal, Very Deep Learning Ch. 9 3


Agenda

◼ Recurrent Neural Networks (RNN)


^ Mapping Types
^ RNNs Basic
^ Gated RNNs
^ RNNs Applications

M. Zeshan Afzal, Very Deep Learning Ch. 9 4


Mapping Types
and
Application Scenarios

M. Zeshan Afzal, Very Deep Learning Ch. 9 5


Mapping Types

M. Zeshan Afzal, Very Deep Learning Ch. 9 6


Mapping types

◼ One to One

M. Zeshan Afzal, Very Deep Learning Ch. 9 7


Mapping types

◼ One to Many

M. Zeshan Afzal, Very Deep Learning Ch. 9 8


Mapping types

◼ Many to One

Image Source: Chen-Wen et.al. Outpatient Text Classification Using Attention-Based Bidirectional LSTM for Robot-Assisted Servicing in Hospital

M. Zeshan Afzal, Very Deep Learning Ch. 9 9


Mapping types

◼ Many to Many

Image Source: https://arxiv.org/abs/1607.05781

M. Zeshan Afzal, Very Deep Learning Ch. 9 10


Mapping types

◼ Many to Many

Image Source: https://medium.com/@gautam.karmakar/attention-for-neural-connectionist-machine-translation-b833d1e085a3

M. Zeshan Afzal, Very Deep Learning Ch. 9 11


Recurrent Neural Networks

M. Zeshan Afzal, Very Deep Learning Ch. 9 12


Recap: Computational Graph

M. Zeshan Afzal, Very Deep Learning Ch. 9 13


Feedforward vs Recurrent Neural Networks

◼ Recurrent Neural Networks (RNNs)


^ Core idea: update hidden state h based on input and previous hidden state using same
update rule (same/shared parameters) at each time step
^ Allows for processing sequences of variable length, not only fixed-sized vectors
^ Infinite memory: h is function of all previous inputs (long-term dependencies)

M. Zeshan Afzal, Very Deep Learning Ch. 9 14


Basic Recurrent Neural Networks

◼ Basic RNNs
^ We use t as the time index (in feedforward networks we used i as layer index)
^ Important: fh and fy do not change over time, unlike in layers of feedforward net
^ General form does not specify the form of the hidden and output mappings

M. Zeshan Afzal, Very Deep Learning Ch. 9 15


Basic Recurrent Neural Networks

◼ Vanilla Single Layer RNN


^ Hidden state ht = linear combination of input xt and previous hidden state ht−1
^ Output ŷt = linear prediction based on current hidden state ht
^ tanh(·) is commonly used as activation function (data is in the range [−1, 1])
^ I Parameters Whh, Wxh, Why, b are constant over time (sequences may vary in length)

M. Zeshan Afzal, Very Deep Learning Ch. 9 16


Mapping Types

◼ RNNs allow for processing variable length inputs and outputs:


^ One to Many: Image captioning (image to sentence)
^ Many to One: Action recognition (video to action)
^ Many to Many: Machine translation (sentence to sentence)
^ Many to Many: Object tracking (video to object location per frame)
^ To determine the length of the output sequence, a stop symbol can be predicted

M. Zeshan Afzal, Very Deep Learning Ch. 9 17


Backpropagation through Time

◼ Backpropagation through Time


^ To train RNNs, we backpropagate gradients through time
^ As all hidden RNN cells share their parameters, gradients get accumulated
^ However, very quickly intractable (memory) for larger sequences (e.g., Wikipedia)

M. Zeshan Afzal, Very Deep Learning Ch. 9 18


Truncated Backpropagation through Time

◼ Truncated Backpropagation through Time


^ Thus, one typically uses truncated backpropagation through time in practice
^ Carry hidden states forward in time forever, but stop backpropagation earlier
^ Total loss is sum of individual loss functions (= negative log likelihood)

M. Zeshan Afzal, Very Deep Learning Ch. 9 19


Truncated Backpropagation through Time

◼ Truncated Backpropagation through Time


^ Thus, one typically uses truncated backpropagation through time in practice
^ Carry hidden states forward in time forever, but stop backpropagation earlier
^ Total loss is sum of individual loss functions (= negative log likelihood)

M. Zeshan Afzal, Very Deep Learning Ch. 9 20


Truncated Backpropagation through Time

◼ Truncated Backpropagation through Time


^ Thus, one typically uses truncated backpropagation through time in practice
^ Carry hidden states forward in time forever, but stop backpropagation earlier
^ Total loss is sum of individual loss functions (= negative log likelihood)

M. Zeshan Afzal, Very Deep Learning Ch. 9 21


Multilayer RNNs

◼ Multilayer RNNs
^ Deeper multi-layer RNNs can be constructed by stacking RNN layers
^ An alternative is to make each individual computation (=RNN cell) deeper

M. Zeshan Afzal, Very Deep Learning Ch. 9 22


Bidirectional RNNs

M. Zeshan Afzal, Very Deep Learning Ch. 9 23


Gated Recurrent Neural Networks

M. Zeshan Afzal, Very Deep Learning Ch. 9 24


RNNs

^ The state update ht is modeled using a zero-centered tanh(·)


^ tanh(·) assumes that the processed data is in the range [−1, 1]
^ Remark: we omit the affine transformations and the output layer for clarity

M. Zeshan Afzal, Very Deep Learning Ch. 9 25


Vanishing / Exploding Gradients

◼ What is the problem with vanilla RNNs?


^ Let us consider an RNN with one dimensional hidden state i.e. it’s a real number
^ We have

^ Thus, the gradient vanishes if tanh(·) saturates as in feedforward networks


^ RNNs require careful initialization to avoid saturating activation functions
M. Zeshan Afzal, Very Deep Learning Ch. 9 26
Vanishing / Exploding Gradients

◼ However, gradients might still misbehave:


^ Let us now assume that the RNN has been initialized well such that the activation functions
are not saturated
^ We now have

M. Zeshan Afzal, Very Deep Learning Ch. 9 27


Vanishing / Exploding Gradients

◼ However, gradients might still misbehave:


^ Let us now assume that the RNN has been initialized well such that the activation functions
are not saturated
^ We now have

M. Zeshan Afzal, Very Deep Learning Ch. 9 28


Vanishing / Exploding Gradients

◼ For whh>1 gradients will explode (become very large, cause divergence)
^ Example: for whh = 1.1 and k = 100 we have wkhh = 13781
^ This problem is often addressed in practice using gradient clipping
^ Forward values do not explode due to bounded tanh(·) activation function
◼ For whh<1 gradients will vanish (no learning in earlier time steps):
^ Example: for whh = 0.9 and k = 100 we have wkhh = 0.0000266
^ Avoiding this problem requires an architectural change
^ But residual connections do not work here as the parameters are shared across time and
the input and desired output at each time step are different

M. Zeshan Afzal, Very Deep Learning Ch. 9 29


Vanishing / Exploding Gradients

◼ Similar situation for vector-valued hidden states


^ Let be the eigen decomposition of the square matrix
^ We have with diagonal eigenvalue matrix Λ
^ Components with eigenvalue < 1 ⇒ vanishing gradient
^ Components with eigenvalue > 1 ⇒ exploding gradient
^ Note that whh is shared across time (unlike in layers of a feedforward network)

M. Zeshan Afzal, Very Deep Learning Ch. 9 30


Gradient Clipping

M. Zeshan Afzal, Very Deep Learning Ch. 9 31


Gated Recurrent Unit

◼ UGRNN: Update Gate Recurrent Neural Network


◼ GRU: Gated Recurrent Unit
◼ LSTM: Long Short-Term Memory
◼ LSTM was the first and most transformative (revolutionized NLP in 2015, e.g.
at Google), but also most complex model. UGRNN and GRU work similarly
well.
◼ Common to all architectures: gates for filtering information

M. Zeshan Afzal, Very Deep Learning Ch. 9 32


Update Gate RNN

◼ ut is called update gate as it determines if the hidden state h is updated or not


◼ st is the next target state that is added to ht−1 with element-wise weights ut
◼ Remark: Gates use sigmoid (∈ [0, 1]), state computation uses tanh (∈ [−1, 1])
◼ Where ʘ denotes the Hadamard product (elementwise product).

M. Zeshan Afzal, Very Deep Learning Ch. 9 33


Gradient Flow

M. Zeshan Afzal, Very Deep Learning Ch. 9 34


Gated Recurrent Unit

◼ Reset gate controls which parts of the state are used to compute next target
state
◼ Update gate controls how much information to pass from previous time step

M. Zeshan Afzal, Very Deep Learning Ch. 9 36


Long Short-Term Memory

◼ Passes along an additional cell state c in addition to the hidden state h. Has 3
gates:
◼ Forget gate determines information to erase from cell state
◼ Input gate determines which values of cell state to update
◼ Output gate determines which elements of cell state to reveal at time t
◼ Remark: Cell update tanh(·) creates new target values st for cell state

M. Zeshan Afzal, Very Deep Learning Ch. 9 37


UGRNN vs. GRU vs. LSTM

UGRNN GRU LSTM


One gate Two gate Three gate
Expose entire state Expose entire state Control exposure
Single update gate Single update gate Input/forget gates
Few parameters Medium parameters Many parameters

A systematic study [Collins et al., 2017] states:


“Our results point to the GRU as being the most learnable of gated RNNs for shallow architectures,
followed by the UGRNN.”

M. Zeshan Afzal, Very Deep Learning Ch. 9 38


Thanks a lot for your Attention

M. Zeshan Afzal, Very Deep Learning Ch. 9 39

You might also like