BioGPT

Overview

The BioGPT model was proposed in BioGPT: generative pre-trained transformer for biomedical text generation and mining by Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon and Tie-Yan Liu. BioGPT is a domain-specific generative pre-trained Transformer language model for biomedical text generation and mining. BioGPT follows the Transformer language model backbone, and is pre-trained on 15M PubMed abstracts from scratch.

The abstract from the paper is the following:

Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.

This model was contributed by kamalkraj. The original code can be found here.

Usage tips

BioGPT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left.
BioGPT was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows BioGPT to generate syntactically coherent text as it can be observed in the run_generation.py example script.
The model can take the past_key_values (for PyTorch) as input, which is the previously computed key/value attention pairs. Using this (past_key_values or past) value prevents the model from re-computing pre-computed values in the context of text generation. For PyTorch, see past_key_values argument of the BioGptForCausalLM.forward() method for more information on its usage.

Using Scaled Dot Product Attention (SDPA)

PyTorch includes a native scaled dot-product attention (SDPA) operator as part of torch.nn.functional. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the official documentation or the GPU Inference page for more information.

SDPA is used by default for torch>=2.1.1 when an implementation is available, but you may also set attn_implementation="sdpa" in from_pretrained() to explicitly request SDPA to be used.

from transformers import BioGptForCausalLM
model = BioGptForCausalLM.from_pretrained("microsoft/biogpt", attn_implementation="sdpa", torch_dtype=torch.float16)

On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with float16 and microsoft/biogpt model with a CausalLM head, we saw the following speedups during training.

For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16 or torch.bfloat16).

num_training_steps	batch_size	seq_len	is cuda	Time per batch (eager - s)	Time per batch (sdpa - s)	Speedup (%)	Eager peak mem (MB)	sdpa peak mem (MB)	Mem saving (%)
100	1	128	False	0.038	0.031	21.301	1601.862	1601.497	0.023
100	1	256	False	0.039	0.034	15.084	1624.944	1625.296	-0.022
100	2	128	False	0.039	0.033	16.820	1624.567	1625.296	-0.045
100	2	256	False	0.065	0.059	10.255	1672.164	1672.164	0.000
100	4	128	False	0.062	0.058	6.998	1671.435	1672.164	-0.044
100	4	256	False	0.113	0.100	13.316	2350.179	1848.435	27.144
100	8	128	False	0.107	0.098	9.883	2098.521	1848.435	13.530
100	8	256	False	0.222	0.196	13.413	3989.980	2986.492	33.601

On a local benchmark (NVIDIA GeForce RTX 2060-8GB, PyTorch 2.3.1, OS Ubuntu 20.04) with float16 and microsoft/biogpt model with a simple AutoModel head, we saw the following speedups during inference.

num_batches	batch_size	seq_len	is cuda	is half	use mask	Per token latency eager (ms)	Per token latency SDPA (ms)	Speedup (%)	Mem eager (MB)	Mem BT (MB)
50	1	64	True	True	True	0.115	0.098	17.392	716.998	716.998
50	1	128	True	True	True	0.115	0.093	24.640	730.916	730.916
50	2	64	True	True	True	0.114	0.096	19.204	730.900	730.900
50	2	128	True	True	True	0.117	0.095	23.529	759.262	759.262
50	4	64	True	True	True	0.113	0.096	18.325	759.229	759.229
50	4	128	True	True	True	0.186	0.178	4.289	816.478	816.478

Resources

Causal language modeling task guide

BioGptConfig

[[autodoc]] BioGptConfig

BioGptTokenizer

[[autodoc]] BioGptTokenizer - save_vocabulary

BioGptModel

[[autodoc]] BioGptModel - forward

BioGptForCausalLM

[[autodoc]] BioGptForCausalLM - forward

BioGptForTokenClassification

[[autodoc]] BioGptForTokenClassification - forward

BioGptForSequenceClassification

[[autodoc]] BioGptForSequenceClassification - forward

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biogpt.md

biogpt.md

BioGPT

Overview

Usage tips

Using Scaled Dot Product Attention (SDPA)

Resources

BioGptConfig

BioGptTokenizer

BioGptModel

BioGptForCausalLM

BioGptForTokenClassification

BioGptForSequenceClassification

Files

biogpt.md

Latest commit

History

biogpt.md

File metadata and controls

BioGPT

Overview

Usage tips

Using Scaled Dot Product Attention (SDPA)

Resources

BioGptConfig

BioGptTokenizer

BioGptModel

BioGptForCausalLM

BioGptForTokenClassification

BioGptForSequenceClassification