The Vision Transformer (ViT) model was proposed in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. It's the first paper that successfully trains a Transformer encoder on ImageNet, attaining very good results compared to familiar convolutional architectures.
The abstract from the paper is the following:
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
ViT architecture. Taken from the original paper.
Following the original Vision Transformer, some follow-up works have been made:
-
DeiT (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision transformers. The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into [
ViTModel
] or [ViTForImageClassification
]. There are 4 variants available (in 3 different sizes): facebook/deit-tiny-patch16-224, facebook/deit-small-patch16-224, facebook/deit-base-patch16-224 and facebook/deit-base-patch16-384. Note that one should use [DeiTImageProcessor
] in order to prepare images for the model. -
BEiT (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
-
DINO (a method for self-supervised training of Vision Transformers) by Facebook AI. Vision Transformers trained using the DINO method show very interesting properties not seen with convolutional models. They are capable of segmenting objects, without having ever been trained to do so. DINO checkpoints can be found on the hub.
-
MAE (Masked Autoencoders) by Facebook AI. By pre-training Vision Transformers to reconstruct pixel values for a high portion (75%) of masked patches (using an asymmetric encoder-decoder architecture), the authors show that this simple method outperforms supervised pre-training after fine-tuning.
This model was contributed by nielsr. The original code (written in JAX) can be found here.
Note that we converted the weights from Ross Wightman's timm library, who already converted the weights from JAX to PyTorch. Credits go to him!
- To feed images to the Transformer encoder, each image is split into a sequence of fixed-size non-overlapping patches, which are then linearly embedded. A [CLS] token is added to serve as representation of an entire image, which can be used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of vectors to a standard Transformer encoder.
- As the Vision Transformer expects each image to be of the same size (resolution), one can use
[
ViTImageProcessor
] to resize (or rescale) and normalize images for the model. - Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
each checkpoint. For example,
google/vit-base-patch16-224
refers to a base-sized architecture with patch resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the hub. - The available checkpoints are either (1) pre-trained on ImageNet-21k (a collection of 14 million images and 21k classes) only, or (2) also fine-tuned on ImageNet (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
- The Vision Transformer was pre-trained using a resolution of 224x224. During fine-tuning, it is often beneficial to use a higher resolution than pre-training (Touvron et al., 2019), (Kolesnikov et al., 2020). In order to fine-tune at higher resolution, the authors perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image.
- The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
PyTorch includes a native scaled dot-product attention (SDPA) operator as part of torch.nn.functional
. This function
encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the
official documentation
or the GPU Inference
page for more information.
SDPA is used by default for torch>=2.1.1
when an implementation is available, but you may also set
attn_implementation="sdpa"
in from_pretrained()
to explicitly request SDPA to be used.
from transformers import ViTForImageClassification
model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224", attn_implementation="sdpa", torch_dtype=torch.float16)
...
For the best speedups, we recommend loading the model in half-precision (e.g. torch.float16
or torch.bfloat16
).
On a local benchmark (A100-40GB, PyTorch 2.3.0, OS Ubuntu 22.04) with float32
and google/vit-base-patch16-224
model, we saw the following speedups during inference.
Batch size | Average inference time (ms), eager mode | Average inference time (ms), sdpa model | Speed up, Sdpa / Eager (x) |
---|---|---|---|
1 | 7 | 6 | 1.17 |
2 | 8 | 6 | 1.33 |
4 | 8 | 6 | 1.33 |
8 | 8 | 6 | 1.33 |
Demo notebooks regarding inference as well as fine-tuning ViT on custom data can be found here. A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with ViT. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
ViTForImageClassification
is supported by:
- A blog post on how to Fine-Tune ViT for Image Classification with Hugging Face Transformers
- A blog post on Image Classification with Hugging Face Transformers and
Keras
- A notebook on Fine-tuning for Image Classification with Hugging Face Transformers
- A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with the Hugging Face Trainer
- A notebook on how to Fine-tune the Vision Transformer on CIFAR-10 with PyTorch Lightning
⚗️ Optimization
- A blog post on how to Accelerate Vision Transformer (ViT) with Quantization using Optimum
⚡️ Inference
- A notebook on Quick demo: Vision Transformer (ViT) by Google Brain
🚀 Deploy
- A blog post on Deploying Tensorflow Vision Models in Hugging Face with TF Serving
- A blog post on Deploying Hugging Face ViT on Vertex AI
- A blog post on Deploying Hugging Face ViT on Kubernetes with TF Serving
[[autodoc]] ViTConfig
[[autodoc]] ViTFeatureExtractor - call
[[autodoc]] ViTImageProcessor - preprocess
[[autodoc]] ViTImageProcessorFast - preprocess
[[autodoc]] ViTModel - forward
[[autodoc]] ViTForMaskedImageModeling - forward
[[autodoc]] ViTForImageClassification - forward
[[autodoc]] TFViTModel - call
[[autodoc]] TFViTForImageClassification - call
[[autodoc]] FlaxViTModel - call
[[autodoc]] FlaxViTForImageClassification - call