202106 a Tutorial of Transformers-邱锡鹏

Download as pdf or txt
Download as pdf or txt
You are on page 1of 108

VALSE Tutorial

A Tutorial of Transformers

邱锡鹏
复旦大学
2021年6月20日
https://xpqiu.github.io
Transformer?

A Neural Network!

Xipeng Qiu (Fudan University) A Tutorial of Transformers 2


Transformer

Xipeng Qiu (Fudan University) A Tutorial of Transformers 3


The Vanilla Transformer

Vaswani, Ashish, et al. "Attention is All you Need." NIPS. 2017.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 4


Transformer Variants (X-formers)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 5


More Details

Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu,


邱锡鹏,神经网络与深度学习, A Survey of Transformers, https://arxiv.org/abs/2106.04554
机械工业出版社,2020.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 6


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 7


Self-attention and Transformer

Xipeng Qiu (Fudan University) A Tutorial of Transformers 8


Language Representation Learning
 How to represent the meaning in machine? • Dense Vector
• a.k.a. Embeddings

KG+Rules Distributed Representation

Xipeng Qiu (Fudan University) A Tutorial of Transformers 9


General Arch. in NLP

Model Driven
+
Data Driven

Xipeng Qiu (Fudan University) A Tutorial of Transformers 10


Sequence-to-Sequence (Seq2Seq)

Auto-Regressive Model

机 器 学 习 $ Outputs

Encoder Decoder

Token
Machine Learning $ 机 器 学 习 Shifted
embeddings
Outputs

Machine Translation

Xipeng Qiu (Fudan University) A Tutorial of Transformers 11


How to design a model?
 Linguistic Properties
 Sequential
 Hierarchical
 Recursive

 Semantic Composition
 Long-term Dependency
 Polysemy
 Context Matters

请 给 听课 的 同学 买 个 苹果

Xipeng Qiu (Fudan University) A Tutorial of Transformers 12


Mainstream Neural Nets
 When processing variable-length sequences of vectors with neural nets, we can
usually turn to the following models

Implicit prior:Local compositionality


model local dependencies within input sequence

Xipeng Qiu (Fudan University) A Tutorial of Transformers 13


How model the long-term dependency directly?
 Fully-connection

Weights 𝛼𝑖𝑗 are generated dynamically with


attention mechanism

Xipeng Qiu (Fudan University) A Tutorial of Transformers 14


Attention Mechanism
 Attention mechanism consists of two steps
 Calculate attention distribution 𝛼,

𝑠 𝑥𝑖 , 𝑞 scoring function

 Compute weighted average of inputs with 𝛼

Xipeng Qiu (Fudan University) A Tutorial of Transformers 15


Apply attention mechanism to model language

Self-Attention

pic source:http://fuyw.top/NLP_02_QANet/
Xipeng Qiu (Fudan University) A Tutorial of Transformers 16
Query-Key-Value (QKV) Model

Xipeng Qiu (Fudan University) A Tutorial of Transformers 17


Multi-head Self-Attention

Xipeng Qiu (Fudan University) A Tutorial of Transformers 18


Multi-Layer Self-Attention

Xipeng Qiu (Fudan University) A Tutorial of Transformers 19


Transformer (Vaswani et al., 2017)
 Broadly speaking, Transformer is a
model built with self-attention.
 Core module
 Self-Attention

 Besides self-attention:
 Position representations
 Layer Normalization
 Skip connection
 Position-wise FFN

 Model usage:
 Encoder only
 Encoder-decoder
 Decoder only

Xipeng Qiu (Fudan University) A Tutorial of Transformers 20


Model Analysis

 When the input sequence 𝑇 is short, the model dimension 𝐷 dominates the
complexity of both self-attention and FFN.
 The bottleneck of the network lies in FFN for short inputs.
 For long input sequences, the sequence length dominates the complexity.
 Self-attention is inefficient in handling long inputs.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 21


Comparing to other network types

 Self-Attention has constant max path length (like fully connected layer), which is
suitable for long-range dependency modeling.
 It is more parallelizable than recurrent layers.
 It has global receptive field, thus doesn’t require stacking layers to model global
dependencies (like convolutional layers).

Xipeng Qiu (Fudan University) A Tutorial of Transformers 22


Comparing to other network types (inductive bias)
 Convolutional networks
 Translation invariance (shared kernel across spatial positions)
 Locality (restricted window)

 Recurrent networks
 Temporal invariance (shared function across timesteps)
 Locality (Markovian structure)

 Transformer
 No structural prior (prone to overfitting in small-scale data)
 Permutation equivariance (requires position representations to encode sequences)

 Transformer vs. Graph Neural Network


 Transformer can be understood as a GNN defined over complete directed graphs (w/ self-loop)
 No explicit structure. Message passing depends solely on similarity measures over contents.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 23


Improvement Methods
 Model Efficiency
 lightweight attention (e.g. sparse attention variants) and Divide-and-conquer methods
(e.g., recurrent and hierarchical mechanism).
 Model Generalization
 Since the transformer is a flexible architecture and makes few assumptions on the
structural bias of input data, it is hard to train on small-scale data.
 introducing structural bias or regularization, pre-training on large-scale unlabeled data,
etc.
 Model Adaptation
 adapting the Transformer to specific downstream tasks and applications.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 24


A taxonomy of X-formers (Transformer variants)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 25


History of X-formers (Attention)
2017 2018 2019 2020 2021
 Transformer  Image Transformer  Star-Transformer  ETC  RFA
 Memory Compressed Attention  Sparse Transformer  Longformer  DPFP
 Local Transformer  BP-Transformer  BigBird  Informer
 Average Attention  Axial Transformer  Routing Transformer  Poolingformer
 Li et al., 2018  Set Transformer  Reformer  Luna
 Low-rank and locality  SAC  Nyströmformer
constrained attention  Sparse Sinkhorn Attention  LazyFormer
 Gaussian Transformer  Linear Transformer  CAMTL
 Adaptive Attention Span  Performer
 Dynamic Routing  Clustered Attention
 Linformer
 CSALR
sparse attention (position based)
 Predictive Attention
sparse attention (content based) Transformer
linearized attention  RealFormer
 Hard-Coded Gaussian Attention
query prototyping
 Synthesizer
memory compression
 Deshpande and Narasimhan,
low-rank 2020
prior attention  Talking-head Attention
 Collaborative MHA
improved multi-head mechanism
 Multi-Scale Transformer

Xipeng Qiu (Fudan University) A Tutorial of Transformers 26


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 27


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 28


QKV Attention (Recall)
How to reduce
the complexity?

Xipeng Qiu (Fudan University) A Tutorial of Transformers 29


Position-based Sparse Attention

 Atomic sparse patterns

Xipeng Qiu (Fudan University) A Tutorial of Transformers 30


Position-based Sparse Attention

 Compound sparse patterns

Xipeng Qiu (Fudan University) A Tutorial of Transformers 31


Star-Transformer
 Reduce the number of connections while
keeping the path length between each
pair of nodes short
 introduces global memory that serves as the
hub node

 Complexity: 𝑂(2𝐿𝐷)
 Prior of locality Global Memory
 Free from position representations
BERT证明Global Memory对Transformer也是有益的。
 More suitable for small- or mid-scale
data
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, Zheng Zhang. Star-Transformer, NAACL 2019,
https://arxiv.org/pdf/1902.09113.pdf

Xipeng Qiu (Fudan University) A Tutorial of Transformers 32


Star-Transformer

Xipeng Qiu (Fudan University) A Tutorial of Transformers 33


Position-based Sparse Attention

 Extended sparse patterns

Xipeng Qiu (Fudan University) A Tutorial of Transformers 34


BP-Transformer (BPT)
Binary Partitioning

BP-Transformer相当于引入了层次化的全局外部节点,任意两个序列节点通过二叉树中的路径互相连接。

Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang. BP-Transformer: Modelling Long-Range Context via Binary Partitioning,
https://arxiv.org/abs/1911.04070

Xipeng Qiu (Fudan University) A Tutorial of Transformers 35


Content-based Sparse Attention
Published as a conference paperatIC L R 2020

One can also condition the sparse


connection on the inputs.
 e.g., use some low-complexity
methods to filter out the key-value
pairs of high similarity to each query.

 Reformer (Kitaev et al., ICLR 2020)


 Routing Transformer (Roy et al.,
TACL 2020)
… plified depi
Figure 2:Sim Kitaev et ction Reformer:
al., ofL SH A ttention show
The ing theTransformer,
Efficient hash-bucketing,sort
ICLRing,and
2020 chunking
steps and the resulting causalattentions. (a-d)A ttention m atrices forthese varieties ofattention.

N ow w e turn to L SH attention, w hich w e can think of in term s of restricting the set P i of targe
item s a query position i can attend to,by only allow ing attention w ithin a single hash bucket.
Xipeng Qiu (Fudan University) A Tutorial of Transformers P i = {j : h (qi) = h (kj )} 36 (4
Linearized Attention

Disentangle the attention into D−1 𝜙 𝑄 𝜙 𝐾 ⊤ 𝑉,


and then the computation is done in reversed
order.
 Complexity of 𝑂(𝑇)

standard attention
the memory matrix

 How to choose/design a feature map?


 How to aggregate the key-value associations into memory
matrix?

linearized attention

Xipeng Qiu (Fudan University) A Tutorial of Transformers 37


Performer

softmax-kernel

Choromanski K, Likhosherstov V, Dohan D, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 38


Query Prototyping
Generate prototype queries from all the queries that serve as the main source of attention
distributions. The represented positions reuse the attention scores of the prototypes or use uniform
distributions.

 Clustered Attention (Vyas et al., 2020)


 Informer (Zhou et al., AAAI 2021)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 39


Informer
Intuition: If a query generates attention distribution that is close to uniform, the attention result for
this query is a trivial average of value and is redundant for the attention mechanism.
 We only need to compute the queries that generate non-trivial attention distributions. This can be
done by defining a sparsity measure:

 The attention distributions are only computed for the top-u queries under this measure.

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond Efficient Transformer for
Long Sequence Time-Series Forecasting. AAAI 2021.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 40


Memory Compression

Reduce complexity by compressing the number of


key-value pairs

 Memory Compressed Attention (Liu et al., ICLR 2018)


 Set Transformer
 Linformer
 Poolingformer (Zhang et al., ICML 2021)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 41


Memory Compressed Attention (MCA)

 A strided convolution is used to compress the number of


key-value pairs.
 MCA is used along with block local attention in the
experiments.

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by Summarizing
Long Sequences, ICLR 2018

Xipeng Qiu (Fudan University) A Tutorial of Transformers 42


Low-Rank Self-Attention

Some empirical and theoretical analyses report that the self-attention matrix is often low-rank –
the rank of the attention matrix is far far lower than input length 𝑇 .

 For short inputs, using key dimension that is larger than sequence length results in over-
parameterization, making Transformer prone to overfitting. → low-rank parameterization
 For long inputs, the attention matrix can be replaced with some low-rank approximation to
reduce the complexity.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 43


Three Properties From Linguistic Viewpoint
 Sparsity
 Intuitively, the number of dependent relations between tokens should be far lower than L2.
 Locality
 Most of the dependency relations occur in the adjacent tokens.
 Low-Rank
 Detaching the local relations, the remaining long-range relations usually many-to-one
relations.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 44


Model Analysis

Statistics of the learned attention matrix of a Transformer which trained on SNLI and the
BERT model.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 45


Low-rank Transformer

Qipeng Guo, Xipeng Qiu, Xiangyang Xue, Zheng Zhang. Low-Rank and Locality Constrained Self-Attention for Sequence Modeling, IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 2019,12. https://ieeexplore.ieee.org/document/8894858

Xipeng Qiu (Fudan University) A Tutorial of Transformers 46


Attention with Prior

 Modeling locality
 Local Transformer (Yang et al., 2018)
 Gaussian Transformer (Guo et al., 2019)
 Prior from lower modules
 Predictive Attention Transformer (Wang et al., 2020)
 RealFormer (He et al., 2020)
 Task related prior
 Conditionally Adaptive Multi-Task Learning (Pilault et al., ICLR2021)
 Attention with only prior
 Uniform:Average Attention Network (Zhang et al., ACL 2018)
 Gaussian:Hard-Coded Gaussian Attention (You et al., ACL 2020)
 Learnable:Random Synthesizer (Tay et al., ICML 2021)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 47


Local Transformer (Yang et al., 2018)

 Predict a central position 𝑝𝑖 for each query 𝑞𝑖 .


 The Gaussian bias term is calculated and added to the
unnormalized attention score.

The deviation can be a hyperparameter or


predicted from input.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 48


Improved Multi-head Mechanism
 Head Behavior Modeling
 Li et al., Multi-Head Attention with Disagreement Regularization, EMNLP 2018
 Talking-head Attention (Sukhbaatar et al., 2020)
 Collaborative multi-head Attention (Cordonnier et al., 2020)
 Restricted Span
 Adaptive Attention Span (Sukhbaatar, ACL 2019)
 Multi-scale Transformer (Guo et al., AAAI 2020)
 Information Aggregation with Dynamic Routing
 Li et al., Information Aggregation for Multi-Head Attention with Routing-by-Agreement, NAACL 2019
 Gu and Feng, Improving Multi-head Attention with Capsule Networks, NLPCC 2019
 Other variants
 Multi-query Attention (Shazeer, 2019): shared key-value pairs between heads for computation efficiency;
 Bhojanapalli et al., Low-Rank Bottleneck in Multi-head Attention Models, ICML 2020:establish that head dimension
should be decoupled from the number of heads。

Xipeng Qiu (Fudan University) A Tutorial of Transformers 49


Which scale should be used?
 The scale measures the distance between two endpoints of attention edges on
average.
 Small-Scale: Local patterns, N-gram
 Large-Scale: Non-Local patterns, long-term dependencies

Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, Zheng Zhang. Multi-Scale Self-Attention for Text Classification, AAAI 2020,
https://arxiv.org/abs/1912.00544

Xipeng Qiu (Fudan University) A Tutorial of Transformers 50


Observations of Pre-trained models
 It is hard to make the trade-off.
 Take a look at what the model learned from data.

Results from BERT

Xipeng Qiu (Fudan University) A Tutorial of Transformers 51


Multi-Scale Self-Attention
 Different attention heads may work  There is a trend of scale over
on different scales. multiple layers.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 52


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 53


Position Representations

 Absolute position
 Fixed sinusoidal encoding (vanilla)
 Learnable embeddings (BERT)
 Leanable sinusoidal encodings
 Relative position
 Shaw et al., 2018
 Transformer-XL
 T5
 Other representations
 TUPE
 Roformer
 Implicit representations
 Complex Embedding
 R-Transformer
 CPE

Xipeng Qiu (Fudan University) A Tutorial of Transformers 54


Relative Position Encodings: Transformer-XL

 Transformer-XL redesigns the computation attention score to capture both content and
position interactions

Relative position Trainable


encoding bias terms

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL: Attentive Language Models beyond a
Fixed-Length Context. ACL 2019

Xipeng Qiu (Fudan University) A Tutorial of Transformers 55


Roformer

 Roformer uses rotatory position embeddings that are multiplied to the queries and keys.

 It encodes absolute positional information, but is translation invariant – the attention score is only relevant to
relative positional offset.
 It is compatible with linearized attention.

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2021. arXiv:2104.09864

Xipeng Qiu (Fudan University) A Tutorial of Transformers 56


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 57


Layer Normalization (LN)

 Placement of LN
 Pre-LN: More stable training;
 Post-LN: Training could diverge – requires learning rate
warm-up, but could lead to better performance when
the model converges.
 Substitutes of LN
 AdaNorm
 scaled ℓ2 normalization
 PowerNorm (PN)
 …
 Norm-free Transformer
 ReZero-Transformer

Post-LN Pre-LN

Xipeng Qiu (Fudan University) A Tutorial of Transformers 58


ReZero

 Inserts a learnable, zero-initialized parameter 𝛼 into


each residual block

 induces better dynamic isometry for input signals and leads to faster
convergence.
 enables training of very deep (128L) Transformers.

Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. ReZero is All You Need: Fast
Convergence at Large Depth. 2020

Xipeng Qiu (Fudan University) A Tutorial of Transformers 59


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 60


Position-wise FFN
 Activation
 ReLU, GELU (Gaussian Error Linear Unit), GLU (Gated Linear Unit)
 Using FFN to enlarge capacity
 Product-key memory layer (Lample et al., NeurIPS 2019)
 Mixture-of-Experts (Gshard, Switch Transformer, Hash Layer)
 Can we drop FFN?
 all-Attention layer (Sukhbaatar et al., 2019): merges FFN into attention module;
 Yang et al., On the Sub-layer Functionalities of Transformer Decoder, finding of EMNLP 2020: points out that decoder
FFN (in encoder-decoder Transformer) can be dropped without affecting performance.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 61


Product-Key Memory
One can replace (some of the) FFN layers with a key-value memory module with a
large number of key-value pairs to increase the network capacity.

The top-k operation takes 𝑂( 𝒦 × 𝐷) complexity,


where |𝒦| denotes the number of key-value pairs .

Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large Memory Layers with Product Keys.
NeurIPS 2019

Xipeng Qiu (Fudan University) A Tutorial of Transformers 62


Product-Key Memory

 Product-key memory uses two sets of sub-keys, each of size | 𝒦|.

The top-k operation now takes 𝑂( 𝒦 × 𝐷)


complexity.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 63


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 64


Lightweight variants
 Some researches are dedicated to making Transformer lightweight (in terms of
computation / parameters)
 Light Transformer (Wu et al., ICLR 2020): replaces self-attention with a two-branch
module consisting of a convolution module and an attention module.
 Funnel Transformer: compresses the number of intermediate representations, resulting
in lower FLOPs and memory footprint.
 DeLighT: uses DeLighT transformation to learn wider representations with low
computation. The attention and FFN can thus be simplified.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 65


DeLighT

Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight
Transformer. arXiv:2008.00623

Xipeng Qiu (Fudan University) A Tutorial of Transformers 66


Inter-Block Connectivity
 Some existing studies modifies the architecture of Transformer by adding path
along which input signals run through the network.
 Realformer, Predictive Attention Transformer: create additional path from previous
attention module to current module.
 Transparent Attention: creates paths from each encoder layer to decoder layers.
 Feedback Transformer: adds path from upper layers to lower layers.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 67


Transparent Attention
 Each cross attention module attends to a weighted average of encoder
representations from all layers

trainable parameter

 The modification creates more paths for backward error signal, and thus eases
optimization for deep Transformers.

Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training Deeper Neural Machine Translation Models with Transparent Attention.
EMNLP 2018.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 68


Adaptive Computation Time

Xipeng Qiu (Fudan University) A Tutorial of Transformers 69


Universal Transformer

 The Universal Transformer with


dynamic halting determines the
number of steps T for each position.
 The parameters are tied across
positions and time steps.
 Once the per-symbol recurrent block
halts, its state is simply copied to the
next step until all blocks halt, or we
reach a maximum number of steps.

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal Transformers. ICLR 2019

Xipeng Qiu (Fudan University) A Tutorial of Transformers 70


Recurrent Transformers
 Recurrent mechanism splits input sequences into segments of shorter sequence, and
processes them one at a time. At each segment, the model reads a cache memory from
previous segment and write representations of current segment to the cache.

 Dai et al., Transformer-XL: Attentive Language Models beyond a Fixed-Length Context, 2019
 Rae et al., Compressive Transformers for Long-Range Sequence Modelling, 2020
 Wu et al., Memformer: The Memory-Augmented Transformer, 2020
 Yoshida et al., Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size, 2020
 …

Xipeng Qiu (Fudan University) A Tutorial of Transformers 71


Transformer-XL

x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

Fixed (No G ra d) New Segment Fixed (No G ra d) New Segment Extended Context

(a)Training phase. (b)E valuation phase.

Figure 2: Illustration ofthe Transform er-X L m odelw ith a segm entlength 4.


 The hidden representations at each layer (including the embeddings) are cached and
per-segm
used toent, w hich additional
produce differs from key-value
the sam e-lmemory
ayer der
fortonext
reusesegment.
the hidden states. T hat is, how can
recurrence in conventional R N N -L M s. C onse- w e keep the positionalinform ation coherentw hen
 Extends effective context length by 𝐿 × 𝑁𝑚𝑒𝑚 , where 𝑁𝑚𝑒𝑚 is the length of cache memory.
quently, the largest possible dependency length w e reuse the states? R ecall that, in the standard
grow s linearly w .r.t. the num ber of layers as w ell Transform er,the inform ation of sequence order is
as Dai,
Zihang the Zhilin
segmYang,
entYiming
lengtYang, .e., OCarbonell,
h, iJaime (N ⇥ Quoc L ),Le,
asand
viRuslan
- provi ded byTransformer-
Salakhutdinov. a setofposiXL:tAttentive
ionalencodi ngs,denot
Language ed a
Models beyond
sualized by the shaded area in Fig. 2b. T his as U 2 R L m ax ⇥d , w here the i-th row U i corre-
Fixed-Length Context. ACL 2019

is anal
Xipeng ogous
Qiu (Fudan to truncated B PT T (M ikolov etA Tutorial
University) al., of sponds to the i-th absolute position w ithin a seg- 72
Transformers
Compressive Transformer

 Similar to Transformer-XL, except that the old cache memory is not discarded but
pushed into a compressed memory, using some compression function.
 Extends effective context length by 𝐿 × (𝑁𝑚𝑒𝑚 +𝑐𝑁𝑐𝑚 ), where 𝑐 is compression rate and 𝑁𝑐𝑚 is the size of
compressed memory

Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive Transformers for Long-Range Sequence
Modelling. ICLR 2020

Xipeng Qiu (Fudan University) A Tutorial of Transformers 73


Hierarchical Transformers
 Hierarchical mechanism organizes input sequences in a hierarchy. The low-level Transformer
module process low level features and the extracted representations are fed to a higher-level
Transformer module.

 HIBERT (Zhang et al., ACL 2019)


 TENER (Yan et al., 2019)
 Transformer in Transformer (Han et al., 2021)
 …

Xipeng Qiu (Fudan University) A Tutorial of Transformers 74


HIBERT

Pre-train Summarization (extractive)


Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document
Summarization. ACL 2019

Xipeng Qiu (Fudan University) A Tutorial of Transformers 75


Exploring Alternative Architectures
 Is the commonly used Transformer architecture (i.e., SAN-FFN) is optimal?
 Several studies have suggested that there exist better architectures (for specific tasks).

Neural Arch. Search


Neural ODE
(Evolved Transformer;
(Macaron Transformer)
DARTSformer)

Layer re-ordering …
(Sandwich Transformer)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 76


Evolved Transformer

 The Evolved Transformer (ET) employs evolution based


architecture search with the standard Transformer seeding
the initial population.
 The searched architecture consistently outperforms standard Transformer on
several machine translation datasets and LM1B language modeling benchmark.

David R. So, Quoc V. Le, and Chen Liang. The Evolved Transformer. ICML 2019

Xipeng Qiu (Fudan University) A Tutorial of Transformers 77


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 78


Pre-trained Transformers
 Transformer has few structural bias, making it prone to overfitting in small-scale
data, limiting its applications. One way to avoid this limitation is to pre-train
Transformer on large-scale corpus.

BERT,
Encoder only RoBERTa
BigBird

Decoder only GPT series

BART
Encoder-Decoder T5
Switch Transformer

Pre-trained Models for Natural Language Processing: A Survey, https://arxiv.org/abs/2003.08271

Xipeng Qiu (Fudan University) A Tutorial of Transformers 79


Pre-trained Models for Transformers

Pre-trained Models for Natural Language Processing: A Survey, https://arxiv.org/abs/2003.08271

Xipeng Qiu (Fudan University) A Tutorial of Transformers 80


BERT
BERT uses Transformer encoder to encode representations for each token in the
sequence. The model is pre-

It utilizes two self-supervised training


objectives
 Masked Language Modeling (MLM)
 Next Sentence Prediction (NSP)

One of the milestone models in pre-


trained models (PTMs)

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. HLT-NAACL 2019

Xipeng Qiu (Fudan University) A Tutorial of Transformers 81


GPT
GPT uses only Transformer decoder for representation learning.

The model is pre-trained to complete a


language modeling task – maximizing the
likelihood of texts in the corpus.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

Xipeng Qiu (Fudan University) A Tutorial of Transformers 82


T5
T5 uses an Encoder-Decoder architecture. Several downstream tasks (including
machine translation, question answering, and classification) are cast as seq2seq
tasks.

The unsupervised training objective is a


denoising – the model needs to predict
corrupted spans from input sequences.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683

Xipeng Qiu (Fudan University) A Tutorial of Transformers 83


Overview

③arch-level variants

④pre-trained Transformers

②module-level variants ⑤task-specific variants

①Transformer background

Xipeng Qiu (Fudan University) A Tutorial of Transformers 84


Applications
 NLP
 Transformer, BERT, Compressive Transformer, TENER, FLAT
 CV
 Image Transformer, DETR, ViT, Swin Transformer, ViViT
 Audio
 Speech Transformer, Streaming Transformer, Reformer-TTS, Music Transformer
 Multimodal
 VisualBERT, VLBERT, VideoBERT, M6, Chimera, DALL-E, CogView

Xipeng Qiu (Fudan University) A Tutorial of Transformers 85


Vision Transformer (ViT)

Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

Xipeng Qiu (Fudan University) A Tutorial of Transformers 86


MusicBERT

Zeng, Mingliang, et al. "MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training." arXiv preprint arXiv:2106.05630 (2021).

Xipeng Qiu (Fudan University) A Tutorial of Transformers 87


TENER: Adapting Transformer Encoder for NER
 The relative direction is important in the NER task.
 Words before “Inc.” are mostly to be an organization, words after “in” are more likely to
be time or location
 The distance between words is also important.
 Only continuous words can form an entity, the former “Louis Vuitton” can not form an
entity with the “Inc.”.

Hang Yan, Bocao Deng, Xiaonan Li, Xipeng Qiu. TENER: Adapting Transformer Encoder for Named Entity Recognition,
https://arxiv.org/abs/1911.04474

Xipeng Qiu (Fudan University) A Tutorial of Transformers 88


Observations
 Observation 1: Position Encoding in Transformer
 The sinusoidal position embedding used in the vanilla Transformer is aware of distance
but unaware of the directionality.

 Observation 2: Smooth Attention in Transformer


 The attention distribution of the vanilla Transformer is scaled and smooth.
 But for NER, a sparse attention is suitable since not all words are necessary to be
attended

Xipeng Qiu (Fudan University) A Tutorial of Transformers 89


Two Improvements
 Direction- and Distance-Aware Attention
 Improve the Transformer with direction- and distance-aware attention

 Un-scaled Dot-Product Attention


 Models performs better without the scaling factor 1/Dk.
 The attention will be sharper without the scaling factor

Xipeng Qiu (Fudan University) A Tutorial of Transformers 90


TENER

Xipeng Qiu (Fudan University) A Tutorial of Transformers 91


FLAT: Chinese NER Using Flat-Lattice Transformer

Xiaonan Li, Hang Yan, Xipeng Qiu , Xuanjing Huang, FLAT: Chinese NER Using Flat-Lattice Transformer, ACL 2020,
https://arxiv.org/abs/1911.04474

Xipeng Qiu (Fudan University) A Tutorial of Transformers 92


CoLAKE: Contextualized Language and Knowledge Embedding
 Word-knowledge Graph

Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang, CoLAKE: Contextualized Language and Knowledge
Embedding, COLING 2020, https://arxiv.org/abs/2010.00309

Xipeng Qiu (Fudan University) A Tutorial of Transformers 93


Transformer Arch. of CoLAKE
 Modify Transformer for word-knowledge graph
Word-knowledge graph is a
positional heterogeneous graph.

Position
Embedding

Type
Embedding

Masked
Self-Attention

Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang, CoLAKE: Contextualized Language and Knowledge
Embedding, COLING 2020, https://arxiv.org/abs/2010.00309

Xipeng Qiu (Fudan University) A Tutorial of Transformers 94


VL-BERT

Su W, Zhu X, Cao Y, et al. VL-BERT: Pre-training of generic visual-linguistic representations. 2019. https://arxiv.org/abs/1908.08530

Xipeng Qiu (Fudan University) A Tutorial of Transformers 95


Summary

Xipeng Qiu (Fudan University) A Tutorial of Transformers 96


Future Directions

CNN → RNN → Transformer → ?

✓ Efficiency
✓ Length limit
✓ Overfitting
✓ Pre-training
✓ Theoretic analysis
✓ Alternative Architecture
✓ Unified multi-modal structure

Xipeng Qiu (Fudan University) A Tutorial of Transformers 97


fastNLP介绍

https://github.com/fastnlp/fastNLP
https://gitee.com/fastnlp/fastNLP

Xipeng Qiu (Fudan University) A Tutorial of Transformers 98


面向自然语言理解的开源框架

Xipeng Qiu (Fudan University) A Tutorial of Transformers 99


面向自然语言理解的开源框架
◼ 相关项目Github/Gitee Star数量总计5K+
中文 文本 文本 序列 序列 ◼ 根据pypistats.org统计,日均下载100+
处理 匹配 摘要 标注 标注
FastHan FastMatch FastSum FastRE fitlog

FastNLP

通用计算框架

CPU GPU 昇腾 飞腾
针对自然语言理解的计算框架
Xipeng Qiu (Fudan University) A Tutorial of Transformers 100
面向自然语言理解的开源框架

中文注释
统一数据容器 开箱即用的 多种 高效 下游应用
中文文档
统一处理流程 预训练模型 神经网络组件 训练测试流程 (20+ NLP任务)
中文任务

Xipeng Qiu (Fudan University) A Tutorial of Transformers 101


Tabular式的数据结构 fastNLP.core.DataSet

Batch1
instance 1 raw_words1 target1 raw_words1 target1
instance 2 raw_words2 target2 raw_words2 target2
… … … …

…..
…..
人们在理解数据时,一般是通
批处理更喜欢以列的方式
过一个一个的sample去理解,
进行读取与处理
即通过行的方式

Xipeng Qiu (Fudan University) A Tutorial of Transformers 102


高效切换Glove、ELMO、 BERT等

代码示例:
# 同时使用两种embedding
embed = StackEmbedding([glove_embed, word2vec_embed])

# 使用character embedding和word embedding一样容易


char_embed = CNNCharEmbedding(vocab) 这里的各种Embedding和pytorch的
embed = StackEmbedding([char_embed, glove_embed]) nn.Embedding是类似的,所以用法上基本
没有区别,极大降低了使用门槛。
Contextual Embedding:
一行代码切换ELMO,BERT, RoBERTa
elmo_embed = ElmoEmbedding(vocab, model_dir_or_name='en-original')
bert_embed = BertEmbedding(vocab, model_dir_or_name='en')
roberta_embed = RoBERTaEmbedding(vocab, model_dir_or_name='en-base')

embed = StackEmbedding([glove_embed, elmo_embed, bert_embed])

Xipeng Qiu (Fudan University) A Tutorial of Transformers 103


https://github.com/fastnlp/fasthan
fastHan:中文自然语言处理工具 https://gitee.com/fastnlp/fasthan

在十多个数据集上达到目前的SOTA结果
Xipeng Qiu (Fudan University) A Tutorial of Transformers 104
fitlog = fast + git + log:可视化、可交互调参工具

以下每一行是一次实验记录
自动计算多行的统计信息

支持复杂搜索语法

下拉筛选 可视化
收敛曲
线
可直接编辑的备忘

Xipeng Qiu (Fudan University) A Tutorial of Transformers 105


https://github.com/fastnlp/fitlog
fitlog可视化、可交互调参工具 https://gitee.com/fastnlp/fitlog
可自动化管理实验代码版本、超参数、实验结果的工具。使用时,在python代码中加入如下的内
容即可。
代码示例: 代码示例:

import fitlog import fitlog


# 设置log记录文件夹 # 设置log记录文件夹
fitlog.set_log_dir('logs') fitlog.set_log_dir('logs')
# 记录超参数 # 记录超参数
fitlog.add_hyper(args) fitlog.add_hyper(args)

for epoch in range(n_epochs): # 直接使用FitlogCallback进行记录


for step in range(num_step_per_epoch): Trainer(data_bundle.get_dataset('train'), model,
# 记录loss dev_data=data_bundle.get_dataset('dev'),
fitlog.add_loss(loss, name='loss', step=step, epoch=epoch) metrics=AccuracyMetric(),
# 记录metric callback=FitlogCallback()).train()
fitlog.add_metric(f1, name='f1', epoch=epoch, step=step)
if better_result:
# 记录结果
fitlog.add_best_metric(f1, name='f1')
Xipeng Qiu (Fudan University) A Tutorial of Transformers 106
fastSum: 一款开源的文本摘要工具包 https://github.com/fastnlp/fastSum
数据集
fastSum提供了CNN/Dailymail、Xsum等12个摘要领域的经典数据集,并为这些数
据集各自设计了不同的dataloader(如CNNDMLoader、XsumLoader等)以便方便
地从服务器上自动下载预处理好的数据集。

经典模型
fastSum提供了抽取式和生成式的经典模型:
(1)抽取式模型:基础的LSTM based以及transformer based的序列标注模型、在抽取
式摘要中表现优异的BERTSUMEXT以及MatchSum等模型
(2)生成式模型:包含带pointer network和coverage机制的LSTM模型以及BERTSUMABS

评估
fastSum提供了基于两种rouge包的两个rouge评估指标:FastRougeMetric和
PyRougeMetric。
Xipeng Qiu (Fudan University) A Tutorial of Transformers 107

You might also like