202106 a Tutorial of Transformers-邱锡鹏
202106 a Tutorial of Transformers-邱锡鹏
202106 a Tutorial of Transformers-邱锡鹏
A Tutorial of Transformers
邱锡鹏
复旦大学
2021年6月20日
https://xpqiu.github.io
Transformer?
A Neural Network!
③arch-level variants
④pre-trained Transformers
①Transformer background
Model Driven
+
Data Driven
Auto-Regressive Model
机 器 学 习 $ Outputs
Encoder Decoder
Token
Machine Learning $ 机 器 学 习 Shifted
embeddings
Outputs
Machine Translation
Semantic Composition
Long-term Dependency
Polysemy
Context Matters
请 给 听课 的 同学 买 个 苹果
𝑠 𝑥𝑖 , 𝑞 scoring function
Self-Attention
pic source:http://fuyw.top/NLP_02_QANet/
Xipeng Qiu (Fudan University) A Tutorial of Transformers 16
Query-Key-Value (QKV) Model
Besides self-attention:
Position representations
Layer Normalization
Skip connection
Position-wise FFN
Model usage:
Encoder only
Encoder-decoder
Decoder only
When the input sequence 𝑇 is short, the model dimension 𝐷 dominates the
complexity of both self-attention and FFN.
The bottleneck of the network lies in FFN for short inputs.
For long input sequences, the sequence length dominates the complexity.
Self-attention is inefficient in handling long inputs.
Self-Attention has constant max path length (like fully connected layer), which is
suitable for long-range dependency modeling.
It is more parallelizable than recurrent layers.
It has global receptive field, thus doesn’t require stacking layers to model global
dependencies (like convolutional layers).
Recurrent networks
Temporal invariance (shared function across timesteps)
Locality (Markovian structure)
Transformer
No structural prior (prone to overfitting in small-scale data)
Permutation equivariance (requires position representations to encode sequences)
③arch-level variants
④pre-trained Transformers
①Transformer background
③arch-level variants
④pre-trained Transformers
①Transformer background
Complexity: 𝑂(2𝐿𝐷)
Prior of locality Global Memory
Free from position representations
BERT证明Global Memory对Transformer也是有益的。
More suitable for small- or mid-scale
data
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xiangyang Xue, Zheng Zhang. Star-Transformer, NAACL 2019,
https://arxiv.org/pdf/1902.09113.pdf
BP-Transformer相当于引入了层次化的全局外部节点,任意两个序列节点通过二叉树中的路径互相连接。
Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, Zheng Zhang. BP-Transformer: Modelling Long-Range Context via Binary Partitioning,
https://arxiv.org/abs/1911.04070
N ow w e turn to L SH attention, w hich w e can think of in term s of restricting the set P i of targe
item s a query position i can attend to,by only allow ing attention w ithin a single hash bucket.
Xipeng Qiu (Fudan University) A Tutorial of Transformers P i = {j : h (qi) = h (kj )} 36 (4
Linearized Attention
standard attention
the memory matrix
linearized attention
softmax-kernel
Choromanski K, Likhosherstov V, Dohan D, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
The attention distributions are only computed for the top-u queries under this measure.
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond Efficient Transformer for
Long Sequence Time-Series Forecasting. AAAI 2021.
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by Summarizing
Long Sequences, ICLR 2018
Some empirical and theoretical analyses report that the self-attention matrix is often low-rank –
the rank of the attention matrix is far far lower than input length 𝑇 .
For short inputs, using key dimension that is larger than sequence length results in over-
parameterization, making Transformer prone to overfitting. → low-rank parameterization
For long inputs, the attention matrix can be replaced with some low-rank approximation to
reduce the complexity.
Statistics of the learned attention matrix of a Transformer which trained on SNLI and the
BERT model.
Qipeng Guo, Xipeng Qiu, Xiangyang Xue, Zheng Zhang. Low-Rank and Locality Constrained Self-Attention for Sequence Modeling, IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 2019,12. https://ieeexplore.ieee.org/document/8894858
Modeling locality
Local Transformer (Yang et al., 2018)
Gaussian Transformer (Guo et al., 2019)
Prior from lower modules
Predictive Attention Transformer (Wang et al., 2020)
RealFormer (He et al., 2020)
Task related prior
Conditionally Adaptive Multi-Task Learning (Pilault et al., ICLR2021)
Attention with only prior
Uniform:Average Attention Network (Zhang et al., ACL 2018)
Gaussian:Hard-Coded Gaussian Attention (You et al., ACL 2020)
Learnable:Random Synthesizer (Tay et al., ICML 2021)
Qipeng Guo, Xipeng Qiu, Pengfei Liu, Xiangyang Xue, Zheng Zhang. Multi-Scale Self-Attention for Text Classification, AAAI 2020,
https://arxiv.org/abs/1912.00544
③arch-level variants
④pre-trained Transformers
①Transformer background
Absolute position
Fixed sinusoidal encoding (vanilla)
Learnable embeddings (BERT)
Leanable sinusoidal encodings
Relative position
Shaw et al., 2018
Transformer-XL
T5
Other representations
TUPE
Roformer
Implicit representations
Complex Embedding
R-Transformer
CPE
Transformer-XL redesigns the computation attention score to capture both content and
position interactions
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer- XL: Attentive Language Models beyond a
Fixed-Length Context. ACL 2019
Roformer uses rotatory position embeddings that are multiplied to the queries and keys.
It encodes absolute positional information, but is translation invariant – the attention score is only relevant to
relative positional offset.
It is compatible with linearized attention.
Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2021. arXiv:2104.09864
③arch-level variants
④pre-trained Transformers
①Transformer background
Placement of LN
Pre-LN: More stable training;
Post-LN: Training could diverge – requires learning rate
warm-up, but could lead to better performance when
the model converges.
Substitutes of LN
AdaNorm
scaled ℓ2 normalization
PowerNorm (PN)
…
Norm-free Transformer
ReZero-Transformer
Post-LN Pre-LN
induces better dynamic isometry for input signals and leads to faster
convergence.
enables training of very deep (128L) Transformers.
Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. ReZero is All You Need: Fast
Convergence at Large Depth. 2020
③arch-level variants
④pre-trained Transformers
①Transformer background
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Large Memory Layers with Product Keys.
NeurIPS 2019
③arch-level variants
④pre-trained Transformers
①Transformer background
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight
Transformer. arXiv:2008.00623
trainable parameter
The modification creates more paths for backward error signal, and thus eases
optimization for deep Transformers.
Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. Training Deeper Neural Machine Translation Models with Transparent Attention.
EMNLP 2018.
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal Transformers. ICLR 2019
Dai et al., Transformer-XL: Attentive Language Models beyond a Fixed-Length Context, 2019
Rae et al., Compressive Transformers for Long-Range Sequence Modelling, 2020
Wu et al., Memformer: The Memory-Augmented Transformer, 2020
Yoshida et al., Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size, 2020
…
Fixed (No G ra d) New Segment Fixed (No G ra d) New Segment Extended Context
is anal
Xipeng ogous
Qiu (Fudan to truncated B PT T (M ikolov etA Tutorial
University) al., of sponds to the i-th absolute position w ithin a seg- 72
Transformers
Compressive Transformer
Similar to Transformer-XL, except that the old cache memory is not discarded but
pushed into a compressed memory, using some compression function.
Extends effective context length by 𝐿 × (𝑁𝑚𝑒𝑚 +𝑐𝑁𝑐𝑚 ), where 𝑐 is compression rate and 𝑁𝑐𝑚 is the size of
compressed memory
Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, and Timothy P. Lillicrap. Compressive Transformers for Long-Range Sequence
Modelling. ICLR 2020
Layer re-ordering …
(Sandwich Transformer)
David R. So, Quoc V. Le, and Chen Liang. The Evolved Transformer. ICML 2019
③arch-level variants
④pre-trained Transformers
①Transformer background
BERT,
Encoder only RoBERTa
BigBird
BART
Encoder-Decoder T5
Switch Transformer
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. HLT-NAACL 2019
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683
③arch-level variants
④pre-trained Transformers
①Transformer background
Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
Zeng, Mingliang, et al. "MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training." arXiv preprint arXiv:2106.05630 (2021).
Hang Yan, Bocao Deng, Xiaonan Li, Xipeng Qiu. TENER: Adapting Transformer Encoder for Named Entity Recognition,
https://arxiv.org/abs/1911.04474
Xiaonan Li, Hang Yan, Xipeng Qiu , Xuanjing Huang, FLAT: Chinese NER Using Flat-Lattice Transformer, ACL 2020,
https://arxiv.org/abs/1911.04474
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang, CoLAKE: Contextualized Language and Knowledge
Embedding, COLING 2020, https://arxiv.org/abs/2010.00309
Position
Embedding
Type
Embedding
Masked
Self-Attention
Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, Zheng Zhang, CoLAKE: Contextualized Language and Knowledge
Embedding, COLING 2020, https://arxiv.org/abs/2010.00309
Su W, Zhu X, Cao Y, et al. VL-BERT: Pre-training of generic visual-linguistic representations. 2019. https://arxiv.org/abs/1908.08530
✓ Efficiency
✓ Length limit
✓ Overfitting
✓ Pre-training
✓ Theoretic analysis
✓ Alternative Architecture
✓ Unified multi-modal structure
https://github.com/fastnlp/fastNLP
https://gitee.com/fastnlp/fastNLP
FastNLP
通用计算框架
CPU GPU 昇腾 飞腾
针对自然语言理解的计算框架
Xipeng Qiu (Fudan University) A Tutorial of Transformers 100
面向自然语言理解的开源框架
中文注释
统一数据容器 开箱即用的 多种 高效 下游应用
中文文档
统一处理流程 预训练模型 神经网络组件 训练测试流程 (20+ NLP任务)
中文任务
Batch1
instance 1 raw_words1 target1 raw_words1 target1
instance 2 raw_words2 target2 raw_words2 target2
… … … …
…..
…..
人们在理解数据时,一般是通
批处理更喜欢以列的方式
过一个一个的sample去理解,
进行读取与处理
即通过行的方式
代码示例:
# 同时使用两种embedding
embed = StackEmbedding([glove_embed, word2vec_embed])
在十多个数据集上达到目前的SOTA结果
Xipeng Qiu (Fudan University) A Tutorial of Transformers 104
fitlog = fast + git + log:可视化、可交互调参工具
以下每一行是一次实验记录
自动计算多行的统计信息
支持复杂搜索语法
下拉筛选 可视化
收敛曲
线
可直接编辑的备忘
经典模型
fastSum提供了抽取式和生成式的经典模型:
(1)抽取式模型:基础的LSTM based以及transformer based的序列标注模型、在抽取
式摘要中表现优异的BERTSUMEXT以及MatchSum等模型
(2)生成式模型:包含带pointer network和coverage机制的LSTM模型以及BERTSUMABS
评估
fastSum提供了基于两种rouge包的两个rouge评估指标:FastRougeMetric和
PyRougeMetric。
Xipeng Qiu (Fudan University) A Tutorial of Transformers 107