202106 a Tutorial of Transformers-邱锡鹏

VALSE Tutorial
A Tutorial of Transformers
邱锡鹏
复旦大学
2021年6月20日
https://xpqiu.github.io
Transformer?
A Neural Network!
Xipeng Qiu (Fudan University) A Tutorial of Transformers 2

Transformer

The Vanilla Transformer
Vaswani, Ashish, et al. "Attention is All you Need." NIPS. 2017.

Transformer Variants (X-formers)

More Details
Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu,

邱锡鹏，神经网络与深度学习， A Survey of Transformers, https://arxiv.org/abs/2106.04554
机械工业出版社，2020.

Overview
③arch-level variants
④pre-trained Transformers
②module-level variants ⑤task-specific variants
①Transformer background

Self-attention and Transformer

Language Representation Learning
 How to represent the meaning in machine? • Dense Vector
• a.k.a. Embeddings
KG+Rules Distributed Representation

General Arch. in NLP
Model Driven
+
Data Driven

Sequence-to-Sequence (Seq2Seq)
Auto-Regressive Model
机器学习 $ Outputs
Encoder Decoder
Token
Machine Learning $ 机器学习 Shifted
embeddings
Outputs
Machine Translation

How to design a model?
 Linguistic Properties
 Sequential
 Hierarchical
 Recursive
 Semantic Composition
 Long-term Dependency
 Polysemy
 Context Matters
请给听课的同学买个苹果

Mainstream Neural Nets
 When processing variable-length sequences of vectors with neural nets, we can
usually turn to the following models
Implicit prior：Local compositionality

model local dependencies within input sequence

How model the long-term dependency directly?
 Fully-connection
Weights 𝛼𝑖𝑗 are generated dynamically with

attention mechanism

Attention Mechanism
 Attention mechanism consists of two steps
 Calculate attention distribution 𝛼，
𝑠 𝑥𝑖 , 𝑞 scoring function
 Compute weighted average of inputs with 𝛼

Apply attention mechanism to model language
Self-Attention
pic source：http://fuyw.top/NLP_02_QANet/
Query-Key-Value (QKV) Model

Multi-head Self-Attention

Multi-Layer Self-Attention

Transformer (Vaswani et al., 2017)
 Broadly speaking, Transformer is a
model built with self-attention.
 Core module
 Self-Attention
 Besides self-attention:
 Position representations
 Layer Normalization
 Skip connection
 Position-wise FFN
 Model usage:
 Encoder only
 Encoder-decoder
 Decoder only

Model Analysis
 When the input sequence 𝑇 is short, the model dimension 𝐷 dominates the
complexity of both self-attention and FFN.
 The bottleneck of the network lies in FFN for short inputs.
 For long input sequences, the sequence length dominates the complexity.
 Self-attention is inefficient in handling long inputs.

Comparing to other network types
 Self-Attention has constant max path length (like fully connected layer), which is
suitable for long-range dependency modeling.
 It is more parallelizable than recurrent layers.
 It has global receptive field, thus doesn’t require stacking layers to model global
dependencies (like convolutional layers).

Comparing to other network types (inductive bias)
 Convolutional networks
 Translation invariance (shared kernel across spatial positions)
 Locality (restricted window)
 Recurrent networks
 Temporal invariance (shared function across timesteps)
 Locality (Markovian structure)
 Transformer
 No structural prior (prone to overfitting in small-scale data)
 Permutation equivariance (requires position representations to encode sequences)
 Transformer vs. Graph Neural Network

 Transformer can be understood as a GNN defined over complete directed graphs (w/ self-loop)
 No explicit structure. Message passing depends solely on similarity measures over contents.

Improvement Methods
 Model Efficiency
 lightweight attention (e.g. sparse attention variants) and Divide-and-conquer methods
(e.g., recurrent and hierarchical mechanism).
 Model Generalization
 Since the transformer is a flexible architecture and makes few assumptions on the
structural bias of input data, it is hard to train on small-scale data.
 introducing structural bias or regularization, pre-training on large-scale unlabeled data,
etc.
 Model Adaptation
 adapting the Transformer to specific downstream tasks and applications.

A taxonomy of X-formers (Transformer variants)

History of X-formers (Attention)
2017 2018 2019 2020 2021
 Transformer  Image Transformer  Star-Transformer  ETC  RFA
 Memory Compressed Attention  Sparse Transformer  Longformer  DPFP
 Local Transformer  BP-Transformer  BigBird  Informer
 Average Attention  Axial Transformer  Routing Transformer  Poolingformer
 Li et al., 2018  Set Transformer  Reformer  Luna
 Low-rank and locality  SAC  Nyströmformer
constrained attention  Sparse Sinkhorn Attention  LazyFormer
 Gaussian Transformer  Linear Transformer  CAMTL
 Adaptive Attention Span  Performer
 Dynamic Routing  Clustered Attention
 Linformer
 CSALR
sparse attention (position based)
 Predictive Attention
sparse attention (content based) Transformer
linearized attention  RealFormer
 Hard-Coded Gaussian Attention
query prototyping
 Synthesizer
memory compression
 Deshpande and Narasimhan,
low-rank 2020
prior attention  Talking-head Attention
 Collaborative MHA
improved multi-head mechanism
 Multi-Scale Transformer

Overview

Overview

QKV Attention (Recall)
How to reduce
the complexity?

Position-based Sparse Attention
 Atomic sparse patterns

 Compound sparse patterns

Star-Transformer
 Reduce the number of connections while
keeping the path length between each
pair of nodes short
 introduces global memory that serves as the
hub node
 Complexity: 𝑂(2𝐿𝐷)
 Prior of locality Global Memory
 Free from position representations
BERT证明Global Memory对Transformer也是有益的。
 More suitable for small- or mid-scale
data
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, Zheng Zhang. Star-Transformer, NAACL 2019,
https://arxiv.org/pdf/1902.09113.pdf

Star-Transformer

 Extended sparse patterns

BP-Transformer (BPT)
Binary Partitioning
BP-Transformer相当于引入了层次化的全局外部节点，任意两个序列节点通过二叉树中的路径互相连接。
Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang. BP-Transformer: Modelling Long-Range Context via Binary Partitioning,
https://arxiv.org/abs/1911.04070

Content-based Sparse Attention
Published as a conference paperatIC L R 2020
One can also condition the sparse

connection on the inputs.
 e.g., use some low-complexity
methods to filter out the key-value
pairs of high similarity to each query.
 Reformer (Kitaev et al., ICLR 2020)

 Routing Transformer (Roy et al.,
TACL 2020)
… pliﬁed depi
Figure 2:Sim Kitaev et ction Reformer:
al., ofL SH A ttention show
The ing theTransformer,
Efficient hash-bucketing,sort
ICLRing,and
2020 chunking
steps and the resulting causalattentions. (a-d)A ttention m atrices forthese varieties ofattention.
N ow w e turn to L SH attention, w hich w e can think of in term s of restricting the set P i of targe
item s a query position i can attend to,by only allow ing attention w ithin a single hash bucket.
Xipeng Qiu (Fudan University) A Tutorial of Transformers P i = {j : h (qi) = h (kj )} 36 (4
Linearized Attention
Disentangle the attention into D−1 𝜙 𝑄 𝜙 𝐾 ⊤ 𝑉,

and then the computation is done in reversed
order.
 Complexity of 𝑂(𝑇)
standard attention
the memory matrix
 How to choose/design a feature map?

 How to aggregate the key-value associations into memory
matrix?
linearized attention

Performer
softmax-kernel
Choromanski K, Likhosherstov V, Dohan D, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.

Query Prototyping
Generate prototype queries from all the queries that serve as the main source of attention
distributions. The represented positions reuse the attention scores of the prototypes or use uniform
distributions.
 Clustered Attention (Vyas et al., 2020)

 Informer (Zhou et al., AAAI 2021)

Informer
Intuition: If a query generates attention distribution that is close to uniform, the attention result for
this query is a trivial average of value and is redundant for the attention mechanism.
 We only need to compute the queries that generate non-trivial attention distributions. This can be
done by defining a sparsity measure:
 The attention distributions are only computed for the top-u queries under this measure.
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond Efficient Transformer for
Long Sequence Time-Series Forecasting. AAAI 2021.

Memory Compression
Reduce complexity by compressing the number of

key-value pairs
 Memory Compressed Attention (Liu et al., ICLR 2018)

 Set Transformer
 Linformer
 Poolingformer (Zhang et al., ICML 2021)

Memory Compressed Attention (MCA)
 A strided convolution is used to compress the number of

key-value pairs.
 MCA is used along with block local attention in the
experiments.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by Summarizing
Long Sequences, ICLR 2018

Low-Rank Self-Attention
Some empirical and theoretical analyses report that the self-attention matrix is often low-rank –
the rank of the attention matrix is far far lower than input length 𝑇 .
 For short inputs, using key dimension that is larger than sequence length results in over-
parameterization, making Transformer prone to overfitting. → low-rank parameterization
 For long inputs, the attention matrix can be replaced with some low-rank approximation to
reduce the complexity.

Three Properties From Linguistic Viewpoint
 Sparsity
 Intuitively, the number of dependent relations between tokens should be far lower than L2.
 Locality
 Most of the dependency relations occur in the adjacent tokens.
 Low-Rank
 Detaching the local relations, the remaining long-range relations usually many-to-one
relations.

Model Analysis
Statistics of the learned attention matrix of a Transformer which trained on SNLI and the
BERT model.

Low-rank Transformer
Qipeng Guo, Xipeng Qiu, Xiangyang Xue, Zheng Zhang. Low-Rank and Locality Constrained Self-Attention for Sequence Modeling, IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 2019,12. https://ieeexplore.ieee.org/document/8894858

Attention with Prior
 Modeling locality
 Local Transformer (Yang et al., 2018)
 Gaussian Transformer (Guo et al., 2019)
 Prior from lower modules
 Predictive Attention Transformer (Wang et al., 2020)
 RealFormer (He et al., 2020)
 Task related prior
 Conditionally Adaptive Multi-Task Learning (Pilault et al., ICLR2021)
 Attention with only prior
 Uniform：Average Attention Network (Zhang et al., ACL 2018)
 Gaussian：Hard-Coded Gaussian Attention (You et al., ACL 2020)
 Learnable：Random Synthesizer (Tay et al., ICML 2021)

Local Transformer (Yang et al., 2018)
 Predict a central position 𝑝𝑖 for each query 𝑞𝑖 .

 The Gaussian bias term is calculated and added to the
unnormalized attention score.
The deviation can be a hyperparameter or

predicted from input.

Improved Multi-head Mechanism
 Head Behavior Modeling
 Li et al., Multi-Head Attention with Disagreement Regularization, EMNLP 2018
 Talking-head Attention (Sukhbaatar et al., 2020)
 Collaborative multi-head Attention (Cordonnier et al., 2020)
 Restricted Span
 Adaptive Attention Span (Sukhbaatar, ACL 2019)
 Multi-scale Transformer (Guo et al., AAAI 2020)
 Information Aggregation with Dynamic Routing
 Li et al., Information Aggregation for Multi-Head Attention with Routing-by-Agreement, NAACL 2019
 Gu and Feng, Improving Multi-head Attention with Capsule Networks, NLPCC 2019
 Other variants
 Multi-query Attention (Shazeer, 2019)： shared key-value pairs between heads for computation efficiency；
 Bhojanapalli et al., Low-Rank Bottleneck in Multi-head Attention Models, ICML 2020：establish that head dimension
should be decoupled from the number of heads。

Which scale should be used?
 The scale measures the distance between two endpoints of attention edges on
average.
 Small-Scale: Local patterns, N-gram
 Large-Scale: Non-Local patterns, long-term dependencies
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, Zheng Zhang. Multi-Scale Self-Attention for Text Classification, AAAI 2020,

Observations of Pre-trained models
 It is hard to make the trade-off.
 Take a look at what the model learned from data.
Results from BERT

Multi-Scale Self-Attention
 Different attention heads may work  There is a trend of scale over
on different scales. multiple layers.

Overview

Position Representations
 Absolute position
 Fixed sinusoidal encoding (vanilla)
 Learnable embeddings (BERT)
 Leanable sinusoidal encodings
 Relative position
 Shaw et al., 2018
 Transformer-XL
 T5
 Other representations
 TUPE
 Roformer
 Implicit representations
 Complex Embedding
 R-Transformer
 CPE

Relative Position Encodings: Transformer-XL
 Transformer-XL redesigns the computation attention score to capture both content and
position interactions
Relative position Trainable

encoding bias terms
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL: Attentive Language Models beyond a
Fixed-Length Context. ACL 2019

Roformer
 Roformer uses rotatory position embeddings that are multiplied to the queries and keys.
 It encodes absolute positional information, but is translation invariant – the attention score is only relevant to
relative positional offset.
 It is compatible with linearized attention.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2021. arXiv:2104.09864

Overview

Layer Normalization (LN)
 Placement of LN
 Pre-LN: More stable training;
 Post-LN: Training could diverge – requires learning rate
warm-up, but could lead to better performance when
the model converges.
 Substitutes of LN
 AdaNorm
 scaled ℓ2 normalization
 PowerNorm (PN)
 …
 Norm-free Transformer
 ReZero-Transformer
Post-LN Pre-LN

ReZero
 Inserts a learnable, zero-initialized parameter 𝛼 into

each residual block
 induces better dynamic isometry for input signals and leads to faster
convergence.
 enables training of very deep (128L) Transformers.
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. ReZero is All You Need: Fast
Convergence at Large Depth. 2020

Overview

Position-wise FFN
 Activation
 ReLU, GELU (Gaussian Error Linear Unit), GLU (Gated Linear Unit)
 Using FFN to enlarge capacity
 Product-key memory layer (Lample et al., NeurIPS 2019)
 Mixture-of-Experts (Gshard, Switch Transformer, Hash Layer)
 Can we drop FFN?
 all-Attention layer (Sukhbaatar et al., 2019): merges FFN into attention module；
 Yang et al., On the Sub-layer Functionalities of Transformer Decoder, finding of EMNLP 2020: points out that decoder
FFN (in encoder-decoder Transformer) can be dropped without affecting performance.

Product-Key Memory
One can replace (some of the) FFN layers with a key-value memory module with a
large number of key-value pairs to increase the network capacity.
The top-k operation takes 𝑂( 𝒦 × 𝐷) complexity,

where |𝒦| denotes the number of key-value pairs .
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large Memory Layers with Product Keys.
NeurIPS 2019

Product-Key Memory
 Product-key memory uses two sets of sub-keys, each of size | 𝒦|.
The top-k operation now takes 𝑂( 𝒦 × 𝐷)

complexity.

Overview

Lightweight variants
 Some researches are dedicated to making Transformer lightweight (in terms of
computation / parameters)
 Light Transformer (Wu et al., ICLR 2020): replaces self-attention with a two-branch
module consisting of a convolution module and an attention module.
 Funnel Transformer: compresses the number of intermediate representations, resulting
in lower FLOPs and memory footprint.
 DeLighT: uses DeLighT transformation to learn wider representations with low
computation. The attention and FFN can thus be simplified.

DeLighT
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight
Transformer. arXiv:2008.00623

Inter-Block Connectivity
 Some existing studies modifies the architecture of Transformer by adding path
along which input signals run through the network.
 Realformer, Predictive Attention Transformer: create additional path from previous
attention module to current module.
 Transparent Attention: creates paths from each encoder layer to decoder layers.
 Feedback Transformer: adds path from upper layers to lower layers.

Transparent Attention
 Each cross attention module attends to a weighted average of encoder
representations from all layers
trainable parameter
 The modification creates more paths for backward error signal, and thus eases
optimization for deep Transformers.
Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training Deeper Neural Machine Translation Models with Transparent Attention.
EMNLP 2018.

Adaptive Computation Time

Universal Transformer
 The Universal Transformer with

dynamic halting determines the
number of steps T for each position.
 The parameters are tied across
positions and time steps.
 Once the per-symbol recurrent block
halts, its state is simply copied to the
next step until all blocks halt, or we
reach a maximum number of steps.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal Transformers. ICLR 2019

Recurrent Transformers
 Recurrent mechanism splits input sequences into segments of shorter sequence, and
processes them one at a time. At each segment, the model reads a cache memory from
previous segment and write representations of current segment to the cache.
 Dai et al., Transformer-XL: Attentive Language Models beyond a Fixed-Length Context, 2019
 Rae et al., Compressive Transformers for Long-Range Sequence Modelling, 2020
 Wu et al., Memformer: The Memory-Augmented Transformer, 2020
 Yoshida et al., Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size, 2020
 …

Transformer-XL
x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
Fixed (No G ra d) New Segment Fixed (No G ra d) New Segment Extended Context
(a)Training phase. (b)E valuation phase.
Figure 2: Illustration ofthe Transform er-X L m odelw ith a segm entlength 4.

 The hidden representations at each layer (including the embeddings) are cached and
per-segm
used toent, w hich additional
produce differs from key-value
the sam e-lmemory
ayer der
fortonext
reusesegment.
the hidden states. T hat is, how can
recurrence in conventional R N N -L M s. C onse- w e keep the positionalinform ation coherentw hen
 Extends effective context length by 𝐿 × 𝑁𝑚𝑒𝑚 , where 𝑁𝑚𝑒𝑚 is the length of cache memory.
quently, the largest possible dependency length w e reuse the states? R ecall that, in the standard
grow s linearly w .r.t. the num ber of layers as w ell Transform er,the inform ation of sequence order is
as Dai,
Zihang the Zhilin
segmYang,
entYiming
lengtYang, .e., OCarbonell,
h, iJaime (N ⇥ Quoc L ),Le,
asand
viRuslan
- provi ded byTransformer-
Salakhutdinov. a setofposiXL:tAttentive
ionalencodi ngs,denot
Language ed a
Models beyond
sualized by the shaded area in Fig. 2b. T his as U 2 R L m ax ⇥d , w here the i-th row U i corre-
Fixed-Length Context. ACL 2019
is anal
Xipeng ogous
Qiu (Fudan to truncated B PT T (M ikolov etA Tutorial
University) al., of sponds to the i-th absolute position w ithin a seg- 72
Transformers
Compressive Transformer
 Similar to Transformer-XL, except that the old cache memory is not discarded but
pushed into a compressed memory, using some compression function.
 Extends effective context length by 𝐿 × (𝑁𝑚𝑒𝑚 +𝑐𝑁𝑐𝑚 ), where 𝑐 is compression rate and 𝑁𝑐𝑚 is the size of
compressed memory
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive Transformers for Long-Range Sequence
Modelling. ICLR 2020

Hierarchical Transformers
 Hierarchical mechanism organizes input sequences in a hierarchy. The low-level Transformer
module process low level features and the extracted representations are fed to a higher-level
Transformer module.
 HIBERT (Zhang et al., ACL 2019)

 TENER (Yan et al., 2019)
 Transformer in Transformer (Han et al., 2021)
 …

HIBERT
Pre-train Summarization (extractive)

Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document
Summarization. ACL 2019

Exploring Alternative Architectures
 Is the commonly used Transformer architecture (i.e., SAN-FFN) is optimal?
 Several studies have suggested that there exist better architectures (for specific tasks).
Neural Arch. Search

Neural ODE
(Evolved Transformer;
(Macaron Transformer)
DARTSformer)
Layer re-ordering …
(Sandwich Transformer)

Evolved Transformer
 The Evolved Transformer (ET) employs evolution based

architecture search with the standard Transformer seeding
the initial population.
 The searched architecture consistently outperforms standard Transformer on
several machine translation datasets and LM1B language modeling benchmark.
David R. So, Quoc V. Le, and Chen Liang. The Evolved Transformer. ICML 2019

Overview

Pre-trained Transformers
 Transformer has few structural bias, making it prone to overfitting in small-scale
data, limiting its applications. One way to avoid this limitation is to pre-train
Transformer on large-scale corpus.
BERT,
Encoder only RoBERTa
BigBird
Decoder only GPT series
BART
Encoder-Decoder T5
Switch Transformer
Pre-trained Models for Natural Language Processing: A Survey, https://arxiv.org/abs/2003.08271

Pre-trained Models for Transformers
Pre-trained Models for Natural Language Processing: A Survey, https://arxiv.org/abs/2003.08271

BERT
BERT uses Transformer encoder to encode representations for each token in the
sequence. The model is pre-
It utilizes two self-supervised training

objectives
 Masked Language Modeling (MLM)
 Next Sentence Prediction (NSP)
One of the milestone models in pre-

trained models (PTMs)
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. HLT-NAACL 2019

GPT
GPT uses only Transformer decoder for representation learning.
The model is pre-trained to complete a

language modeling task – maximizing the
likelihood of texts in the corpus.
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018

T5
T5 uses an Encoder-Decoder architecture. Several downstream tasks (including
machine translation, question answering, and classification) are cast as seq2seq
tasks.
The unsupervised training objective is a

denoising – the model needs to predict
corrupted spans from input sequences.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683

Overview

Applications
 NLP
 Transformer, BERT, Compressive Transformer, TENER, FLAT
 CV
 Image Transformer, DETR, ViT, Swin Transformer, ViViT
 Audio
 Speech Transformer, Streaming Transformer, Reformer-TTS, Music Transformer
 Multimodal
 VisualBERT, VLBERT, VideoBERT, M6, Chimera, DALL-E, CogView

Vision Transformer (ViT)
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

MusicBERT
Zeng, Mingliang, et al. "MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training." arXiv preprint arXiv:2106.05630 (2021).

TENER: Adapting Transformer Encoder for NER
 The relative direction is important in the NER task.
 Words before “Inc.” are mostly to be an organization, words after “in” are more likely to
be time or location
 The distance between words is also important.
 Only continuous words can form an entity, the former “Louis Vuitton” can not form an
entity with the “Inc.”.
Hang Yan, Bocao Deng, Xiaonan Li, Xipeng Qiu. TENER: Adapting Transformer Encoder for Named Entity Recognition,

Observations
 Observation 1: Position Encoding in Transformer
 The sinusoidal position embedding used in the vanilla Transformer is aware of distance
but unaware of the directionality.
 Observation 2: Smooth Attention in Transformer

 The attention distribution of the vanilla Transformer is scaled and smooth.
 But for NER, a sparse attention is suitable since not all words are necessary to be
attended

Two Improvements
 Direction- and Distance-Aware Attention
 Improve the Transformer with direction- and distance-aware attention
 Un-scaled Dot-Product Attention

 Models performs better without the scaling factor 1/Dk.
 The attention will be sharper without the scaling factor

TENER

FLAT: Chinese NER Using Flat-Lattice Transformer
Xiaonan Li, Hang Yan, Xipeng Qiu , Xuanjing Huang, FLAT: Chinese NER Using Flat-Lattice Transformer, ACL 2020,

CoLAKE: Contextualized Language and Knowledge Embedding
 Word-knowledge Graph
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang, CoLAKE: Contextualized Language and Knowledge
Embedding, COLING 2020, https://arxiv.org/abs/2010.00309

Transformer Arch. of CoLAKE
 Modify Transformer for word-knowledge graph
Word-knowledge graph is a
positional heterogeneous graph.
Position
Embedding
Type
Embedding
Masked
Self-Attention
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang, CoLAKE: Contextualized Language and Knowledge
Embedding, COLING 2020, https://arxiv.org/abs/2010.00309

VL-BERT
Su W, Zhu X, Cao Y, et al. VL-BERT: Pre-training of generic visual-linguistic representations. 2019. https://arxiv.org/abs/1908.08530

Summary

Future Directions
CNN → RNN → Transformer → ?
✓ Efficiency
✓ Length limit
✓ Overfitting
✓ Pre-training
✓ Theoretic analysis
✓ Alternative Architecture
✓ Unified multi-modal structure

fastNLP介绍
https://github.com/fastnlp/fastNLP
https://gitee.com/fastnlp/fastNLP

面向自然语言理解的开源框架

◼ 相关项目Github/Gitee Star数量总计5K+
中文文本文本序列序列 ◼ 根据pypistats.org统计，日均下载100+
处理匹配摘要标注标注
FastHan FastMatch FastSum FastRE fitlog
FastNLP
通用计算框架
CPU GPU 昇腾飞腾
针对自然语言理解的计算框架
中文注释
统一数据容器开箱即用的多种高效下游应用
中文文档
统一处理流程预训练模型神经网络组件训练测试流程（20+ NLP任务）
中文任务

Tabular式的数据结构 fastNLP.core.DataSet
Batch1
instance 1 raw_words1 target1 raw_words1 target1
instance 2 raw_words2 target2 raw_words2 target2
… … … …
…..
…..
人们在理解数据时，一般是通
批处理更喜欢以列的方式
过一个一个的sample去理解，
进行读取与处理
即通过行的方式

高效切换Glove、ELMO、 BERT等
代码示例：
# 同时使用两种embedding
embed = StackEmbedding([glove_embed, word2vec_embed])
# 使用character embedding和word embedding一样容易

char_embed = CNNCharEmbedding(vocab) 这里的各种Embedding和pytorch的
embed = StackEmbedding([char_embed, glove_embed]) nn.Embedding是类似的，所以用法上基本
没有区别，极大降低了使用门槛。
Contextual Embedding:
一行代码切换ELMO，BERT, RoBERTa
elmo_embed = ElmoEmbedding(vocab, model_dir_or_name='en-original')
bert_embed = BertEmbedding(vocab, model_dir_or_name='en')
roberta_embed = RoBERTaEmbedding(vocab, model_dir_or_name='en-base')
embed = StackEmbedding([glove_embed, elmo_embed, bert_embed])

https://github.com/fastnlp/fasthan
fastHan:中文自然语言处理工具 https://gitee.com/fastnlp/fasthan
在十多个数据集上达到目前的SOTA结果
fitlog = fast + git + log：可视化、可交互调参工具
以下每一行是一次实验记录
自动计算多行的统计信息
支持复杂搜索语法
下拉筛选可视化
收敛曲
线
可直接编辑的备忘

https://github.com/fastnlp/fitlog
fitlog可视化、可交互调参工具 https://gitee.com/fastnlp/fitlog
可自动化管理实验代码版本、超参数、实验结果的工具。使用时，在python代码中加入如下的内
容即可。
代码示例：代码示例：
import fitlog import fitlog

# 设置log记录文件夹 # 设置log记录文件夹
fitlog.set_log_dir('logs') fitlog.set_log_dir('logs')
# 记录超参数 # 记录超参数
fitlog.add_hyper(args) fitlog.add_hyper(args)
for epoch in range(n_epochs): # 直接使用FitlogCallback进行记录

for step in range(num_step_per_epoch): Trainer(data_bundle.get_dataset('train'), model,
# 记录loss dev_data=data_bundle.get_dataset('dev'),
fitlog.add_loss(loss, name='loss', step=step, epoch=epoch) metrics=AccuracyMetric(),
# 记录metric callback=FitlogCallback()).train()
fitlog.add_metric(f1, name='f1', epoch=epoch, step=step)
if better_result:
# 记录结果
fitlog.add_best_metric(f1, name='f1')
fastSum: 一款开源的文本摘要工具包 https://github.com/fastnlp/fastSum
数据集
fastSum提供了CNN/Dailymail、Xsum等12个摘要领域的经典数据集，并为这些数
据集各自设计了不同的dataloader（如CNNDMLoader、XsumLoader等）以便方便
地从服务器上自动下载预处理好的数据集。
经典模型
fastSum提供了抽取式和生成式的经典模型：
（1）抽取式模型：基础的LSTM based以及transformer based的序列标注模型、在抽取
式摘要中表现优异的BERTSUMEXT以及MatchSum等模型
（2）生成式模型：包含带pointer network和coverage机制的LSTM模型以及BERTSUMABS
评估
fastSum提供了基于两种rouge包的两个rouge评估指标：FastRougeMetric和
PyRougeMetric。

202106 a Tutorial of Transformers-邱锡鹏

Uploaded by

Copyright:

Available Formats

202106 a Tutorial of Transformers-邱锡鹏

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

202106 a Tutorial of Transformers-邱锡鹏

Uploaded by

Copyright:

Available Formats

VALSE Tutorial

Xipeng Qiu (Fudan University) A Tutorial of Transformers 2

Xipeng Qiu (Fudan University) A Tutorial of Transformers 3

Vaswani, Ashish, et al. "Attention is All you Need." NIPS. 2017.

Xipeng Qiu (Fudan University) A Tutorial of Transformers 4

Xipeng Qiu (Fudan University) A Tutorial of Transformers 5

Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu,

Xipeng Qiu (Fudan University) A Tutorial of Transformers 6

②module-level variants ⑤task-specific variants

Xipeng Qiu (Fudan University) A Tutorial of Transformers 7

Xipeng Qiu (Fudan University) A Tutorial of Transformers 8

KG+Rules Distributed Representation

Xipeng Qiu (Fudan University) A Tutorial of Transformers 9

Xipeng Qiu (Fudan University) A Tutorial of Transformers 10

Xipeng Qiu (Fudan University) A Tutorial of Transformers 11

Xipeng Qiu (Fudan University) A Tutorial of Transformers 12

Implicit prior：Local compositionality

Xipeng Qiu (Fudan University) A Tutorial of Transformers 13

Weights 𝛼𝑖𝑗 are generated dynamically with

Xipeng Qiu (Fudan University) A Tutorial of Transformers 14

 Compute weighted average of inputs with 𝛼

Xipeng Qiu (Fudan University) A Tutorial of Transformers 15

Xipeng Qiu (Fudan University) A Tutorial of Transformers 17

Xipeng Qiu (Fudan University) A Tutorial of Transformers 18

Xipeng Qiu (Fudan University) A Tutorial of Transformers 19

Xipeng Qiu (Fudan University) A Tutorial of Transformers 20

Xipeng Qiu (Fudan University) A Tutorial of Transformers 21

Xipeng Qiu (Fudan University) A Tutorial of Transformers 22

 Transformer vs. Graph Neural Network

Xipeng Qiu (Fudan University) A Tutorial of Transformers 23

Xipeng Qiu (Fudan University) A Tutorial of Transformers 24

Xipeng Qiu (Fudan University) A Tutorial of Transformers 25

Xipeng Qiu (Fudan University) A Tutorial of Transformers 26

②module-level variants ⑤task-specific variants

Xipeng Qiu (Fudan University) A Tutorial of Transformers 27

②module-level variants ⑤task-specific variants

Xipeng Qiu (Fudan University) A Tutorial of Transformers 28

Xipeng Qiu (Fudan University) A Tutorial of Transformers 29

 Atomic sparse patterns

Xipeng Qiu (Fudan University) A Tutorial of Transformers 30

 Compound sparse patterns

Xipeng Qiu (Fudan University) A Tutorial of Transformers 31

Xipeng Qiu (Fudan University) A Tutorial of Transformers 32

Xipeng Qiu (Fudan University) A Tutorial of Transformers 33

 Extended sparse patterns

Xipeng Qiu (Fudan University) A Tutorial of Transformers 34

Xipeng Qiu (Fudan University) A Tutorial of Transformers 35

One can also condition the sparse

 Reformer (Kitaev et al., ICLR 2020)

Disentangle the attention into D−1 𝜙 𝑄 𝜙 𝐾 ⊤ 𝑉,

 How to choose/design a feature map?

Xipeng Qiu (Fudan University) A Tutorial of Transformers 37

Xipeng Qiu (Fudan University) A Tutorial of Transformers 38

 Clustered Attention (Vyas et al., 2020)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 39

Xipeng Qiu (Fudan University) A Tutorial of Transformers 40

Reduce complexity by compressing the number of

 Memory Compressed Attention (Liu et al., ICLR 2018)

Xipeng Qiu (Fudan University) A Tutorial of Transformers 41

 A strided convolution is used to compress the number of