Timeline: Timeline of Natural Language Processing Models
Timeline: Timeline of Natural Language Processing Models
Timeline: Timeline of Natural Language Processing Models
head attention mechanism, proposed in a 2017 paper "Attention Is All You Need".[1] It has no
recurrent units, and thus requires less training time than previous recurrent neural
architectures, such as long short-term memory (LSTM),[2] and its later variation has been
prevalently adopted for training large language models (LLM) on large (language) datasets,
such as the Wikipedia corpus and Common Crawl.[3] Text is converted to numerical
representations called tokens, and each token is converted into a vector via looking up from
a word embedding table.[1] At each layer, each token is then contextualized within the scope of
the context window with other (unmasked) tokens via a parallel multi-head attention
mechanism allowing the signal for key tokens to be amplified and less important tokens to be
diminished. The transformer paper, published in 2017, is based on the softmax-based attention
mechanism proposed by Bahdanau et. al. in 2014 for machine translation,[4][5] and the Fast
Weight Controller, similar to a transformer, proposed in 1992.[6][7][8]
This architecture is now used not only in natural language processing and computer
vision,[9] but also in audio[10] and multi-modal processing. It has also led to the development
of pre-trained systems, such as generative pre-trained
transformers (GPTs)[11] and BERT[12] (Bidirectional Encoder Representations from
Transformers).
Training[edit]
Methods for stabilizing training[edit]
The plain transformer architecture had difficulty converging. In the original paper[1] the authors
recommended using learning rate warmup. That is, the learning rate should linearly scale up
from 0 to maximal value for the first part of the training (usually recommended to be 2% of the
total number of training steps), before decaying again.
A 2020 paper found that using layer normalization before (instead of after) multiheaded
attention and feedforward layers stabilizes training, not requiring learning rate warmup.[30]
The GT3 model integrates CWTE, SWTE, and TTE using a self-adaptive gate layer, enabling
efficient and effective fusion of three types of features for end-to-end text-driven stock market
prediction.[34]
Pretrain-finetune[edit]
Transformers typically undergo self-supervised learning involving unsupervised pretraining
followed by supervised fine-tuning. Pretraining is typically done on a larger dataset than fine-
tuning, due to the limited availability of labeled training data. Tasks for pretraining and fine-
tuning commonly include:
• language modeling[12]
• next-sentence prediction[12]
• question answering[3]
• reading comprehension
• sentiment analysis[1]
• paraphrasing[1]
The T5 transformer paper[35] documents a large number of pretraining tasks. Some examples
are:
• restoring corrupted text: Thank you <X> me to your party <Y> week. -
> <X> for inviting <Y> last <Z> where the <Z> means "end of output".
• translation: translate English to German: That is good. -> Das ist
gut. .
• judging the grammatical acceptability of a sentence: cola sentence: The
course is jumping well. -> not acceptable .
Applications[edit]
The transformer has had great success in natural language processing (NLP), for example the
tasks of machine translation and time series prediction. Many large language models such
as GPT-2, GPT-3, GPT-4, Claude, BERT, XLNet, RoBERTa and ChatGPT demonstrate the
ability of transformers to perform a wide variety of such NLP-related tasks, and have the
potential to find real-world applications. These may include:
• machine translation
• document summarization
• document generation
• named entity recognition (NER)[36]
• biological sequence analysis
• writing computer code based on requirements expressed in natural language.
• video understanding.
In addition to the NLP applications, it has also been successful in other fields, such
as computer vision, or the protein folding applications (such as AlphaFold).
As an illustrative example, Ithaca is an encoder-only transformer with three output heads. It
takes as input ancient Greek inscription as sequences of characters, but with illegible
characters replaced with "-". Its three output heads respectively outputs probability distributions
over Greek characters, location of inscription, and date of inscription.[37]
Implementations[edit]
The transformer model has been implemented in standard deep learning frameworks such
as TensorFlow and PyTorch.
Transformers is a library produced by Hugging Face that supplies transformer-based
architectures and pretrained models.[11]
Architecture