Very Deep Learning

Very Deep Learning
Lecture 09
Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

MindGarage, University of Kaiserslautern
[email protected]
M. Zeshan Afzal, Very Deep Learning Ch. 9

Recap
M. Zeshan Afzal, Very Deep Learning Ch. 9 2

Other Computer Vision Tasks
Semantic Classification Object Instance
Segmentation + Localization Detection Segmentation
GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

TREE, SKY
No objects, just pixels Single Object Multiple Object This image is CC0 public domain

Agenda
◼ Recurrent Neural Networks (RNN)

^ Mapping Types
^ RNNs Basic
^ Gated RNNs
^ RNNs Applications

Mapping Types
and
Application Scenarios

Mapping Types

Mapping types
◼ One to One

Mapping types
◼ One to Many

Mapping types
◼ Many to One
Image Source: Chen-Wen et.al. Outpatient Text Classification Using Attention-Based Bidirectional LSTM for Robot-Assisted Servicing in Hospital

Mapping types
◼ Many to Many
Image Source: https://arxiv.org/abs/1607.05781

Mapping types
◼ Many to Many
Image Source: https://medium.com/@gautam.karmakar/attention-for-neural-connectionist-machine-translation-b833d1e085a3

Recurrent Neural Networks

Recap: Computational Graph

Feedforward vs Recurrent Neural Networks
◼ Recurrent Neural Networks (RNNs)

^ Core idea: update hidden state h based on input and previous hidden state using same
update rule (same/shared parameters) at each time step
^ Allows for processing sequences of variable length, not only fixed-sized vectors
^ Infinite memory: h is function of all previous inputs (long-term dependencies)

Basic Recurrent Neural Networks
◼ Basic RNNs
^ We use t as the time index (in feedforward networks we used i as layer index)
^ Important: fh and fy do not change over time, unlike in layers of feedforward net
^ General form does not specify the form of the hidden and output mappings

Basic Recurrent Neural Networks
◼ Vanilla Single Layer RNN

^ Hidden state ht = linear combination of input xt and previous hidden state ht−1
^ Output ŷt = linear prediction based on current hidden state ht
^ tanh(·) is commonly used as activation function (data is in the range [−1, 1])
^ I Parameters Whh, Wxh, Why, b are constant over time (sequences may vary in length)

Mapping Types
◼ RNNs allow for processing variable length inputs and outputs:

^ One to Many: Image captioning (image to sentence)
^ Many to One: Action recognition (video to action)
^ Many to Many: Machine translation (sentence to sentence)
^ Many to Many: Object tracking (video to object location per frame)
^ To determine the length of the output sequence, a stop symbol can be predicted

Backpropagation through Time
◼ Backpropagation through Time

^ To train RNNs, we backpropagate gradients through time
^ As all hidden RNN cells share their parameters, gradients get accumulated
^ However, very quickly intractable (memory) for larger sequences (e.g., Wikipedia)

Truncated Backpropagation through Time
◼ Truncated Backpropagation through Time

^ Thus, one typically uses truncated backpropagation through time in practice
^ Carry hidden states forward in time forever, but stop backpropagation earlier
^ Total loss is sum of individual loss functions (= negative log likelihood)





Multilayer RNNs
◼ Multilayer RNNs
^ Deeper multi-layer RNNs can be constructed by stacking RNN layers
^ An alternative is to make each individual computation (=RNN cell) deeper

Bidirectional RNNs

Gated Recurrent Neural Networks

RNNs
^ The state update ht is modeled using a zero-centered tanh(·)

^ tanh(·) assumes that the processed data is in the range [−1, 1]
^ Remark: we omit the affine transformations and the output layer for clarity

Vanishing / Exploding Gradients
◼ What is the problem with vanilla RNNs?

^ Let us consider an RNN with one dimensional hidden state i.e. it’s a real number
^ We have
^ Thus, the gradient vanishes if tanh(·) saturates as in feedforward networks

^ RNNs require careful initialization to avoid saturating activation functions
◼ However, gradients might still misbehave:

^ Let us now assume that the RNN has been initialized well such that the activation functions
are not saturated
^ We now have

◼ However, gradients might still misbehave:

^ Let us now assume that the RNN has been initialized well such that the activation functions
are not saturated
^ We now have

◼ For whh>1 gradients will explode (become very large, cause divergence)
^ Example: for whh = 1.1 and k = 100 we have wkhh = 13781
^ This problem is often addressed in practice using gradient clipping
^ Forward values do not explode due to bounded tanh(·) activation function
◼ For whh<1 gradients will vanish (no learning in earlier time steps):
^ Example: for whh = 0.9 and k = 100 we have wkhh = 0.0000266
^ Avoiding this problem requires an architectural change
^ But residual connections do not work here as the parameters are shared across time and
the input and desired output at each time step are different

◼ Similar situation for vector-valued hidden states

^ Let be the eigen decomposition of the square matrix
^ We have with diagonal eigenvalue matrix Λ
^ Components with eigenvalue < 1 ⇒ vanishing gradient
^ Components with eigenvalue > 1 ⇒ exploding gradient
^ Note that whh is shared across time (unlike in layers of a feedforward network)

Gradient Clipping

Gated Recurrent Unit
◼ UGRNN: Update Gate Recurrent Neural Network

◼ GRU: Gated Recurrent Unit
◼ LSTM: Long Short-Term Memory
◼ LSTM was the first and most transformative (revolutionized NLP in 2015, e.g.
at Google), but also most complex model. UGRNN and GRU work similarly
well.
◼ Common to all architectures: gates for filtering information

Update Gate RNN
◼ ut is called update gate as it determines if the hidden state h is updated or not

◼ st is the next target state that is added to ht−1 with element-wise weights ut
◼ Remark: Gates use sigmoid (∈ [0, 1]), state computation uses tanh (∈ [−1, 1])
◼ Where ʘ denotes the Hadamard product (elementwise product).

Gradient Flow

Gated Recurrent Unit
◼ Reset gate controls which parts of the state are used to compute next target
state
◼ Update gate controls how much information to pass from previous time step

Long Short-Term Memory
◼ Passes along an additional cell state c in addition to the hidden state h. Has 3
gates:
◼ Forget gate determines information to erase from cell state
◼ Input gate determines which values of cell state to update
◼ Output gate determines which elements of cell state to reveal at time t
◼ Remark: Cell update tanh(·) creates new target values st for cell state

UGRNN vs. GRU vs. LSTM
UGRNN GRU LSTM

One gate Two gate Three gate
Expose entire state Expose entire state Control exposure
Single update gate Single update gate Input/forget gates
Few parameters Medium parameters Many parameters
A systematic study [Collins et al., 2017] states:

“Our results point to the GRU as being the most learnable of gated RNNs for shallow architectures,
followed by the UGRNN.”

Thanks a lot for your Attention

Very Deep Learning

Uploaded by

Copyright:

Available Formats

Very Deep Learning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Very Deep Learning

Uploaded by

Copyright:

Available Formats

Very Deep Learning

Dr. Muhammad Zeshan Afzal, Prof. Didier Stricker

M. Zeshan Afzal, Very Deep Learning Ch. 9

M. Zeshan Afzal, Very Deep Learning Ch. 9 2

GRASS, CAT, CAT DOG, DOG, CAT DOG, DOG, CAT

M. Zeshan Afzal, Very Deep Learning Ch. 9 3

◼ Recurrent Neural Networks (RNN)

M. Zeshan Afzal, Very Deep Learning Ch. 9 4

M. Zeshan Afzal, Very Deep Learning Ch. 9 5

M. Zeshan Afzal, Very Deep Learning Ch. 9 6

M. Zeshan Afzal, Very Deep Learning Ch. 9 7

M. Zeshan Afzal, Very Deep Learning Ch. 9 8

M. Zeshan Afzal, Very Deep Learning Ch. 9 9

Image Source: https://arxiv.org/abs/1607.05781

M. Zeshan Afzal, Very Deep Learning Ch. 9 10

Image Source: https://medium.com/@gautam.karmakar/attention-for-neural-connectionist-machine-translation-b833d1e085a3

M. Zeshan Afzal, Very Deep Learning Ch. 9 11

M. Zeshan Afzal, Very Deep Learning Ch. 9 12

M. Zeshan Afzal, Very Deep Learning Ch. 9 13

◼ Recurrent Neural Networks (RNNs)

M. Zeshan Afzal, Very Deep Learning Ch. 9 14

M. Zeshan Afzal, Very Deep Learning Ch. 9 15

◼ Vanilla Single Layer RNN

M. Zeshan Afzal, Very Deep Learning Ch. 9 16

◼ RNNs allow for processing variable length inputs and outputs:

M. Zeshan Afzal, Very Deep Learning Ch. 9 17

◼ Backpropagation through Time

M. Zeshan Afzal, Very Deep Learning Ch. 9 18

◼ Truncated Backpropagation through Time

M. Zeshan Afzal, Very Deep Learning Ch. 9 19

◼ Truncated Backpropagation through Time

M. Zeshan Afzal, Very Deep Learning Ch. 9 20

◼ Truncated Backpropagation through Time

M. Zeshan Afzal, Very Deep Learning Ch. 9 21

M. Zeshan Afzal, Very Deep Learning Ch. 9 22

M. Zeshan Afzal, Very Deep Learning Ch. 9 23

M. Zeshan Afzal, Very Deep Learning Ch. 9 24

^ The state update ht is modeled using a zero-centered tanh(·)

M. Zeshan Afzal, Very Deep Learning Ch. 9 25

◼ What is the problem with vanilla RNNs?

^ Thus, the gradient vanishes if tanh(·) saturates as in feedforward networks

◼ However, gradients might still misbehave:

M. Zeshan Afzal, Very Deep Learning Ch. 9 27

◼ However, gradients might still misbehave:

M. Zeshan Afzal, Very Deep Learning Ch. 9 28

M. Zeshan Afzal, Very Deep Learning Ch. 9 29

◼ Similar situation for vector-valued hidden states

M. Zeshan Afzal, Very Deep Learning Ch. 9 30

M. Zeshan Afzal, Very Deep Learning Ch. 9 31

◼ UGRNN: Update Gate Recurrent Neural Network

M. Zeshan Afzal, Very Deep Learning Ch. 9 32

◼ ut is called update gate as it determines if the hidden state h is updated or not

M. Zeshan Afzal, Very Deep Learning Ch. 9 33

M. Zeshan Afzal, Very Deep Learning Ch. 9 34

M. Zeshan Afzal, Very Deep Learning Ch. 9 36

M. Zeshan Afzal, Very Deep Learning Ch. 9 37

UGRNN GRU LSTM

A systematic study [Collins et al., 2017] states:

M. Zeshan Afzal, Very Deep Learning Ch. 9 38