Efficientnetv2: Smaller Models and Faster Training
Efficientnetv2: Smaller Models and Faster Training
Efficientnetv2: Smaller Models and Faster Training
Abstract
EffNetV2-XL(21k)
87
This paper introduces EfficientNetV2, a new fam- L(21k)
ter efficiency of EfficientNets (Tan & Le, 2019a), we start • We propose an improved method of progressive learn-
by systematically studying the training bottlenecks in Effi- ing, which adaptively adjusts regularization along with
cientNets. Our study shows in EfficientNets: (1) training image size. We show that it speeds up training, and
with very large image sizes is slow; (2) depthwise convolu- simultaneously improves accuracy.
tions are slow in early layers. (3) equally scaling up every
stage is sub-optimal. Based on these observations, we de- • We demonstrate up to 11x faster training speed and
sign a search space enriched with additional ops such as up to 6.8x better parameter efficiency on ImageNet,
Fused-MBConv, and apply training-aware NAS and scaling CIFAR, Cars, and Flowers dataset, than prior art.
to jointly optimize model accuracy, training speed, and pa-
rameter size. Our found networks, named EfficientNetV2, 2. Related work
train up to 4x faster than prior models (Figure 3), while
being up to 6.8x smaller in parameter size. Training and Parameter efficiency: Many works, such
as DenseNet (Huang et al., 2017) and EfficientNet (Tan
Our training can be further sped up by progressively increas- & Le, 2019a), focus on parameter efficiency, aiming to
ing image size during training. Many previous works, such achieve better accuracy with less parameters. More recent
as progressive resizing (Howard, 2018), FixRes (Touvron works aim to improve training or inference speed instead of
et al., 2019), and Mix&Match (Hoffer et al., 2019), have parameter efficiency. For example, RegNet (Radosavovic
used smaller image sizes in training; however, they usually et al., 2020), ResNeSt (Zhang et al., 2020), TResNet (Ridnik
keep the same regularization for all image sizes, causing a et al., 2020), and EfficientNet-X (Li et al., 2021) focus on
drop in accuracy. We argue that keeping the same regular- GPU and/or TPU inference speed; Lambda Networks (Bello,
ization for different image sizes is not ideal: for the same 2021), NFNets (Brock et al., 2021), BoTNets (Srinivas et al.,
network, small image size leads to small network capac- 2021), ResNet-RS (Bello et al., 2021) focus on TPU training
ity and thus requires weak regularization; vice versa, large speed. However, their training speed often comes with the
image size requires stronger regularization to combat overfit- cost of more parameters. This paper aims to significantly im-
ting (see Section 4.1). Based on this insight, we propose an prove both training and parameter efficiency than prior art.
improved method of progressive learning: in the early train-
ing epochs, we train the network with small image size and Progressive Training: Previous works have proposed dif-
weak regularization (e.g., dropout and data augmentation), ferent kinds of progressive training, which dynamically
then we gradually increase image size and add stronger reg- change the training settings or networks, for GANs (Karras
ularization. Built upon progressive resizing (Howard, 2018), et al., 2018), transfer learning (Karras et al., 2018), adver-
but by dynamically adjusting regularization, our approach sarial learning (Yu et al., 2019), and language models (Press
can speed up the training without causing accuracy drop. et al., 2021). Progressive resizing (Howard, 2018) is mostly
related to our approach, which aims to improve training
With the improved progressive learning, our EfficientNetV2
speed. However, it usually comes with the cost of accuracy
achieves strong results on ImageNet, CIFAR-10, CIFAR-
drop. For example, Fastai team use progressive resizing in
100, Cars, and Flowers dataset. On ImageNet, we achieve
the DAWNBench competition for fast training, but it has
85.7% top-1 accuracy while training 3x - 9x faster and being
to increase the final image size with higher inference cost
up to 6.8x smaller than previous models (Figure 1). Our Ef-
to meet the accuracy constraint (Howard, 2018). Another
ficientNetV2 and progressive learning also make it easier to
closely related work is Mix&Match (Hoffer et al., 2019),
train models on larger datasets. For example, ImageNet21k
which randomly sample different image size for each batch.
(Russakovsky et al., 2015) is about 10x larger than ImageNet
Both progressive resizing and Mix&Match use the same reg-
ILSVRC2012, but our EfficientNetV2 can finish the training
ularization for all image sizes, causing a drop in accuracy.
within two days using moderate computing resources of 32
In this paper, our main difference is to adaptively adjust
TPUv3 cores. By pretraining on the public ImageNet21k 2 ,
regularization as well so that we can improve both training
our EfficientNetV2 achieves 87.3% top-1 accuracy on Ima-
speed and accuracy. Our approach is also partially inspired
geNet ILSVRC2012, outperforming the recent ViT-L/16 by
by curriculum learning (Bengio et al., 2009), which sched-
2.0% accuracy while training 5x-11x faster (Figure 1).
ules training examples from easy to hard. Our approach also
Our contributions are threefold: gradually increases learning difficulty by adding more regu-
larization, but we don’t selectively pick training examples.
• We introduce EfficientNetV2, a new family of smaller
and faster models. Found by our training-aware NAS Neural Architecture Search (NAS): By automating the
and scaling, EfficientNetV2 outperform previous mod- network design process, NAS has been used to optimize the
els in both training speed and parameter efficiency. network architecture for image classification (Zoph et al.,
2018), object detection (Chen et al., 2019; Tan et al., 2020),
2
We do not compare results on non-public JFT or Instagram . segmentation (Liu et al., 2019), hyperparameters (Dong
2
EfficientNetV2: Smaller Models and Faster Training
et al., 2020), and other applications (Elsken et al., 2019). Table 2. EfficientNet-B6 accuracy and training throughput for dif-
Previous NAS works mostly focus on improving FLOPs ferent batch sizes and image size.
efficiency (Tan & Le, 2019b;a) or inference efficiency (Tan TPUv3 imgs/sec/core V100 imgs/sec/gpu
Top-1 Acc. batch=32 batch=128 batch=12 batch=24
et al., 2019; Cai et al., 2019; Wu et al., 2019; Li et al., 2021).
train size=512 84.3% 42 OOM 29 OOM
Unlike prior works, this paper uses NAS to optimize training train size=380 84.6% 76 93 37 52
and parameter efficiency.
larization during training.
3. EfficientNetV2 Architecture Design
In this section, we study the training bottlenecks of Efficient- Depthwise convolutions are slow in early layers: An-
Net (Tan & Le, 2019a), and introduce our training-aware other training bottleneck of EfficientNet comes from the
NAS and scaling, as well as EfficientNetV2 models. extensive depthwise convolutions (Sifre, 2014). Depthwise
convolutions have fewer parameters and FLOPs than regular
3.1. Review of EfficientNet convolutions, but they often cannot fully utilize modern ac-
celerators. Recently, Fused-MBConv is proposed in (Gupta
EfficientNet (Tan & Le, 2019a) is a family of models that are & Tan, 2019) and later used in (Gupta & Akin, 2020; Xiong
optimized for FLOPs and parameter efficiency. It leverages et al., 2020; Li et al., 2021) to better utilize mobile or server
NAS to search for the baseline EfficientNet-B0 model that accelerators. It replaces the depthwise conv3x3 and expan-
has better trade-off on accuracy and FLOPs. The baseline sion conv1x1 in MBConv (Sandler et al., 2018; Tan & Le,
model is then scaled up with a simple compound scaling 2019a) with a single regular conv3x3, as shown in Figure
strategy to obtain a family of models B1-B7. While many re- 2. To systematically compares these two building blocks,
cent works have claimed large gains on training or inference we gradually replace the original MBConv in EfficientNet-
speed, they are often much worse than EfficientNet in terms B4 with Fused-MBConv (Table 3). When applied in early
of parameters and FLOPs efficiency (Table 1). In this paper, stage 1-3, Fused-MBConv can improve training speed with a
we aim to improve the training speed while maintaining the small overhead on parameters and FLOPs, but if we replace
parameter efficiency. all blocks with Fused-MBConv (stage 1-7), then it signifi-
cantly increases parameters and FLOPs while also slowing
Table 1. EfficientNets have good parameter and FLOPs efficiency.
down the training. Finding the right combination of these
Top-1 Acc. Params FLOPs
two building blocks, MBConv and Fused-MBConv, is non-
EfficientNet-B6 (Tan & Le, 2019a) 84.3% 43M 19B trivial, which motivates us to leverage neural architecture
ResNet-RS-420 (Bello et al., 2021) 84.4% 192M 128B
NFNet-F1 (Brock et al., 2021) 84.7% 133M 36B
search to automatically search for the best combination.
conv1x1 conv1x1
We study the training bottlenecks of EfficientNet (Tan & Le,
2019a), henceforth is also called EfficientNetV1, and a few
SE SE
simple techniques to improve training speed.
H,W,4C H,W,4C
depthwise
Training with very large image sizes is slow: As pointed conv3x3
3
EfficientNetV2: Smaller Models and Faster Training
Equally scaling up every stage is sub-optimal: Efficient- Table 4. EfficientNetV2-S architecture – MBConv and Fused-
Net equally scales up all stages using a simple compound MBConv blocks are described in Figure 2.
scaling rule. For example, when depth coefficient is 2, then Stage Operator Stride #Channels #Layers
all stages in the networks would double the number of lay- 0 Conv3x3 2 24 1
ers. However, these stages are not equally contributed to 1 Fused-MBConv1, k3x3 1 24 2
2 Fused-MBConv4, k3x3 2 48 4
the training speed and parameter efficiency. In this paper, 3 Fused-MBConv4, k3x3 2 64 4
we will use a non-uniform scaling strategy to gradually add 4 MBConv4, k3x3, SE0.25 2 128 6
more layers to later stages. In addition, EfficientNets ag- 5 MBConv6, k3x3, SE0.25 1 160 9
6 MBConv6, k3x3, SE0.25 2 272 15
gressively scale up image size, leading to large memory 7 Conv1x1 & Pooling & FC - 1792 1
consumption and slow training. To address this issue, we
slightly modify the scaling rule and restrict the maximum
image size to a smaller value. compensate the reduced receptive field resulted from the
smaller kernel size. (4) Lastly, EfficientNetV2 completely
removes the last stride-1 stage in the original EfficientNet,
3.3. Training-Aware NAS and Scaling
perhaps due to its large parameter size and memory access
To this end, we have learned multiple design choices for im- overhead.
proving training speed. To search for the best combinations
of those choices, we now propose a training-aware NAS. EfficientNetV2 Scaling: We scale up EfficientNetV2-S to
obtain EfficientNetV2-M/L using similar compound scaling
NAS Search: Our training-aware NAS framework is as (Tan & Le, 2019a), with a few additional optimizations:
largely based on previous NAS works (Tan et al., 2019; (1) we restrict the maximum inference image size to 480,
Tan & Le, 2019a), but aims to jointly optimize accuracy, as very large images often lead to expensive memory and
parameter efficiency, and training efficiency on modern ac- training speed overhead; (2) as a heuristic, we also gradually
celerators. Specifically, we use EfficientNet as our backbone. add more layers to later stages (e.g., stage 5 and 6 in Table
Our search space is a stage-based factorized space similar 4) in order to increase the network capacity without adding
to (Tan et al., 2019), which consists of the design choices for much runtime overhead.
convolutional operation types {MBConv, Fused-MBConv}, EffNetV2 NFNet
number of layers, kernel size {3x3, 5x5}, expansion ratio 85.5
{1, 4, 6}. On the other hand, we reduce the search space
size by (1) removing unnecessary search options such as
Imagenet Top-1 Accuracy (%)
85.0 EffNet(reprod)
LambdaResNet
pooling skip ops, since they are never used in the original BoTNet
EfficientNets; (2) reusing the same channel sizes from the 84.5 ResNet-RS
EffNet(baseline)
backbone as they are already searched in (Tan & Le, 2019a).
Since the search space is smaller, we can simply apply ran- 84.0
dom search on much larger networks that have comparable
size as EfficientNet-B4. Specifically, we sample up to 1000 83.5
models and train each model about 10 epochs with reduced
image size for training. Our search reward combines the 83.0
model accuracy A, the normalized training step time S, 100 200 300 400 500 600 700 800
Steptime(ms) batch 32 per core
and the parameter size P , using a simple weighted product
A · S w · P v , where w = -0.07 and v = -0.05 are the hyper- Figure 3. ImageNet accuracy and training step time on TPUv3 –
parameters that are empirically determined to balance the Lower step time is better; all models are trained with fixed image
trade-offs similar to (Tan et al., 2019). size without progressive learning.
EfficientNetV2 Architecture: Table 4 shows the architec- Training Speed Comparison: Figure 3 compares the train-
ture for our searched model EfficientNetV2-S. Compared ing step time for our new EfficientNetV2, where all models
to the EfficientNet backbone, our searched EfficientNetV2 are trained with fixed image size without progressive learn-
has several major distinctions: (1) The first difference is ing. For EfficientNet (Tan & Le, 2019a), we show two
EfficientNetV2 extensively uses both MBConv (Sandler curves: one is trained with the original inference size, and
et al., 2018; Tan & Le, 2019a) and the newly added the other is trained with about 30% smaller image size, same
fused-MBConv (Gupta & Tan, 2019) in the early layers. (2) as NFNet (Touvron et al., 2019; Brock et al., 2021). All
Secondly, EfficientNetV2 prefers smaller expansion ratio models are trained with 350 epochs, except NFNets are
for MBConv since smaller expansion ratios tend to have trained with 360 epochs, so all models have a similar num-
less memory access overhead. (3) Thirdly, EfficientNetV2 ber of training steps. Interestingly, we observe that when
prefers smaller 3x3 kernel sizes, but it adds more layers to trained properly, EfficientNets still achieve pretty strong
4
EfficientNetV2: Smaller Models and Faster Training
epoch=300
4. Progressive Learning
Figure 4. Training process in our improved progressive learning –
4.1. Motivation It starts with small image size and weak regularization (epoch=1),
As discussed in Section 3, image size plays an important and then gradually increase the learning difficulty with larger im-
role in training efficiency. In addition to FixRes (Touvron age sizes and stronger regularization: larger dropout rate, Ran-
dAugment magnitude, and mixup ratio (e.g., epoch=300).
et al., 2019), many other works dynamically change image
sizes during training (Howard, 2018; Hoffer et al., 2019),
Formally, suppose the whole training has N total steps, the
but they often cause a drop in accuracy.
target image size is Se , with a list of regularization magni-
We hypothesize the accuracy drop comes from the unbal- tude Φe = {φke }, where k represents a type of regularization
anced regularization: when training with different image such as dropout rate or mixup rate value. We divide the train-
sizes, we should also adjust the regularization strength ac- ing into M stages: for each stage 1 ≤ i ≤ M , the model
cordingly (instead of using a fixed regularization as in previ- is trained with image size Si and regularization magnitude
ous works). In fact, it is common that large models require Φi = {φki }. The last stage M would use the targeted image
stronger regularization to combat overfitting: for example, size Se and regularization Φe . For simplicity, we heuristi-
EfficientNet-B7 uses larger dropout and stronger data aug- cally pick the initial image size S0 and regularization Φ0 ,
mentation than the B0. In this paper, we argue that even and then use a linear interpolation to determine the value
for the same network, smaller image size leads to smaller for each stage. Algorithm 1 summarizes the procedure.
network capacity and thus needs weaker regularization; vice At the beginning of each stage, the network will inherit
versa, larger image size leads to more computations with all weights from the previous stage. Unlike transformers,
larger capacity, and thus more vulnerable to overfitting. whose weights (e.g., position embedding) may depend on
input length, ConvNet weights are independent to image
To validate our hypothesis, we train a model, sampled from
sizes and thus can be inherited easily.
our search space, with different image sizes and data aug-
mentations (Table 5). When image size is small, it has the Algorithm 1 Progressive learning with adaptive regularization.
best accuracy with weak augmentation; but for larger im-
ages, it performs better with stronger augmentation. This Input: Initial image size S0 and regularization {φk0 }.
insight motivates us to adaptively adjust regularization along Input: Final image size Se and regularization {φke }.
Input: Number of total training steps N and stages M .
with image size during training, leading to our improved for i = 0 to M − 1 do
method of progressive learning. Image size: Si ← S0 + (Se − S0 ) · Mi−1
Regularization: Ri ← {φki = φk0 + (φke − φk0 ) · Mi−1 }
Table 5. ImageNet top-1 accuracy. We use RandAug (Cubuk et al., Train the model for MN
steps with Si and Ri .
2020), and report mean and stdev for 3 runs. end for
Size=128 Size=192 Size=300
RandAug magnitude=5 78.3 ±0.16 81.2 ±0.06 82.5 ±0.05 Our improved progressive learning is generally compatible
RandAug magnitude=10 78.0 ±0.08 81.6 ±0.08 82.7 ±0.08 to existing regularization. For simplicity, this paper mainly
RandAug magnitude=15 77.7 ±0.15 81.5 ±0.05 83.2 ±0.09
studies the following three types of regularization:
4.2. Progressive Learning with adaptive Regularization • Dropout (Srivastava et al., 2014): a network-level reg-
ularization, which reduces co-adaptation by randomly
Figure 4 illustrates the training process of our improved dropping channels. We will adjust the dropout rate γ.
progressive learning: in the early training epochs, we train
the network with smaller images and weak regularization, • RandAugment (Cubuk et al., 2020): a per-image data
such that the network can learn simple representations easily augmentation, with adjustable magnitude .
and fast. Then, we gradually increase image size but also • Mixup (Zhang et al., 2018): a cross-image data aug-
making learning more difficult by adding stronger regular- mentation. Given two images with labels (xi , yi )
ization. Our approach is built upon (Howard, 2018) that and (xj , yj ), it combines them with mixup ratio λ:
progressively changes image size, but here we adaptively x̃i = λxj + (1 − λ)xi and y˜i = λyj + (1 − λ)yi . We
adjust regularization as well. would adjust mixup ratio λ during training.
5
EfficientNetV2: Smaller Models and Faster Training
5. Main Results ing speed and parameter efficiency. Notably, this speedup is
a combination of progressive training and better networks,
This section presents our experimental setups, the main and we will study the individual impact for each of them in
results on ImageNet, and the transfer learning results on our ablation studies.
CIFAR-10, CIFAR-100, Cars, and Flowers.
Recently, Vision Transformers have demonstrated impres-
5.1. ImageNet ILSVRC2012 sive results on ImageNet accuracy and training speed. How-
ever, here we show that properly designed ConvNets with
Setup: ImageNet ILSVRC2012 (Russakovsky et al., 2015) improved training method can still largely outperform vi-
contains about 1.28M training images and 50,000 validation sion transformers in both accuracy and training efficiency.
images with 1000 classes. During architecture search or In particular, our EfficientNetV2-L achieves 85.7% top-1
hyperparameter tuning, we reserve 25,000 images (about accuracy, surpassing ViT-L/16(21k), a much larger trans-
2%) from the training set as minival for accuracy evalua- former model pretrained on a larger ImageNet21k dataset.
tion. We also use minival to perform early stopping. Our Here, ViTs are not well tuned on ImageNet ILSVRC2012;
ImageNet training settings largely follow EfficientNets (Tan DeiTs use the same architectures as ViTs, but achieve better
& Le, 2019a): RMSProp optimizer with decay 0.9 and results by adding more regularization.
momentum 0.9; batch norm momentum 0.99; weight de-
cay 1e-5. Each model is trained for 350 epochs with total Although our EfficientNetV2 models are optimized for train-
batch size 4096. Learning rate is first warmed up from 0 ing, they also perform well for inference, because training
to 0.256, and then decayed by 0.97 every 2.4 epochs. We speed often correlates with inference speed. Figure 5 visu-
use exponential moving average with 0.9999 decay rate, alizes the model size, FLOPs, and inference latency based
RandAugment (Cubuk et al., 2020), Mixup (Zhang et al., on Table 7. Since latency often depends on hardware and
2018), Dropout (Srivastava et al., 2014), and stochastic software, here we use the same PyTorch Image Models
depth (Huang et al., 2016) with 0.8 survival probability. codebase (Wightman, 2021) and run all models on the same
machine using the batch size 16. In general, our models
Table 6. Progressive training settings for EfficientNetV2. have slightly better parameters/FLOPs efficiency than Effi-
S M L cientNets, but our inference latency is up to 3x faster than
min max min max min max EfficientNets. Compared to the recent ResNeSt that are spe-
Image Size 128 300 128 380 128 380 cially optimized for GPUs, our EfficientNetV2-M achieves
RandAugment 5 15 5 20 5 25 0.6% better accuracy with 2.8x faster inference speed.
Mixup alpha 0 0 0 0.2 0 0.4
Dropout rate 0.1 0.3 0.1 0.4 0.1 0.5 5.2. ImageNet21k
For progressive learning, we divide the training process Setup: ImageNet21k (Russakovsky et al., 2015) contains
into four stages with about 87 epochs per stage: the early about 13M training images with 21,841 classes. The original
stage uses a small image size with weak regularization, ImageNet21k doesn’t have train/eval split, so we reserve ran-
while the later stages use larger image sizes with stronger domly picked 100,000 images as validation set and use the
regularization, as described in Algorithm 1. Table 6 shows remaining as training set. We largely reuse the same training
the minimum (for the first stage) and maximum (for the settings as ImageNet ILSVRC2012 with a few changes: (1)
last stage) values of image size and regularization. For we change the training epochs to 60 or 30 to reduce training
simplicity, all models use the same minimum values of size time, and use cosine learning rate decay that can adapt to
and regularization, but they adopt different maximum values, different steps without extra tuning; (2) since each image
as larger models generally require more regularization to has multiple labels, we normalize the labels to have sum
combat overfitting. Following (Touvron et al., 2020), our of 1 before computing softmax loss. After pretrained on
maximum image size for training is about 20% smaller than ImageNet21k, each model is finetuned on ILSVRC2012 for
inference, but we don’t finetune any layers after training. 15 epochs using cosine learning rate decay.
6
EfficientNetV2: Smaller Models and Faster Training
Table 7. EfficientNetV2 Performance Results on ImageNet (Russakovsky et al., 2015) – Infer-time is measured on V100 GPU
FP16 with batch size 16 using the same codebase (Wightman, 2021); Train-time is the total training time normalized for 32 TPU cores.
Models marked with 21k are pretrained on ImageNet21k with 13M images, and others are directly trained on ImageNet ILSVRC2012
with 1.28M images from scratch. All EfficientNetV2 models are trained with our improved method of progressive learning.
Model Top-1 Acc. Params FLOPs Infer-time(ms) Train-time (hours)
EfficientNet-B3 (Tan & Le, 2019a) 81.5% 12M 1.9B 19 10
EfficientNet-B4 (Tan & Le, 2019a) 82.9% 19M 4.2B 30 21
EfficientNet-B5 (Tan & Le, 2019a) 83.7% 30M 10B 60 43
EfficientNet-B6 (Tan & Le, 2019a) 84.3% 43M 19B 97 75
EfficientNet-B7 (Tan & Le, 2019a) 84.7% 66M 38B 170 139
RegNetY-8GF (Radosavovic et al., 2020) 81.7% 39M 8B 21 -
RegNetY-16GF (Radosavovic et al., 2020) 82.9% 84M 16B 32 -
ResNeSt-101 (Zhang et al., 2020) 83.0% 48M 13B 31 -
ResNeSt-200 (Zhang et al., 2020) 83.9% 70M 36B 76 -
ResNeSt-269 (Zhang et al., 2020) 84.5% 111M 78B 160 -
ConvNets TResNet-L (Ridnik et al., 2020) 83.8% 56M - 45 -
& Hybrid TResNet-XL (Ridnik et al., 2020) 84.3% 78M - 66 -
EfficientNet-X (Li et al., 2021) 84.7% 73M 91B - -
NFNet-F0 (Brock et al., 2021) 83.6% 72M 12B 30 8.9
NFNet-F1 (Brock et al., 2021) 84.7% 133M 36B 70 20
NFNet-F2 (Brock et al., 2021) 85.1% 194M 63B 124 36
NFNet-F3 (Brock et al., 2021) 85.7% 255M 115B 203 65
NFNet-F4 (Brock et al., 2021) 85.9% 316M 215B 309 126
ResNet-RS (Bello et al., 2021) 84.4% 192M 128B - 61
LambdaResNet-420-hybrid (Bello, 2021) 84.9% 125M - - 67
BotNet-T7-hybrid (Srinivas et al., 2021) 84.7% 75M 46B - 95
BiT-M-R152x2 (21k) (Kolesnikov et al., 2020) 85.2% 236M 135B 500 -
ViT-B/32 (Dosovitskiy et al., 2021) 73.4% 88M 13B 13 -
ViT-B/16 (Dosovitskiy et al., 2021) 74.9% 87M 56B 68 -
DeiT-B (ViT+reg) (Touvron et al., 2021) 81.8% 86M 18B 19 -
Vision DeiT-B-384 (ViT+reg) (Touvron et al., 2021) 83.1% 86M 56B 68 -
Transformers T2T-ViT-19 (Yuan et al., 2021) 81.4% 39M 8.4B - -
T2T-ViT-24 (Yuan et al., 2021) 82.2% 64M 13B - -
ViT-B/16 (21k) (Dosovitskiy et al., 2021) 84.6% 87M 56B 68 -
ViT-L/16 (21k) (Dosovitskiy et al., 2021) 85.3% 304M 192B 195 172
EfficientNetV2-S 83.9% 24M 8.8B 24 7.1
EfficientNetV2-M 85.1% 55M 24B 57 13
ConvNets EfficientNetV2-L 85.7% 121M 53B 98 24
(ours) EfficientNetV2-S (21k) 85.0% 24M 8.8B 24 9.0
EfficientNetV2-M (21k) 86.1% 55M 24B 57 15
EfficientNetV2-L (21k) 86.8% 121M 53B 98 26
We do not include models pretrained on non-public Instagram/JFT images, or models with extra distillation or ensemble.
87 87 87
EffNetV2(21k) EffNetV2(21k) EffNetV2(21k)
Imagenet ILSVRC Top-1 Accuracy (%)
Imagenet ILSVRC Top-1 Accuracy (%)
Imagenet ILSVRC Top-1 Accuracy (%)
86 86 86
NFNet NFNet NFNet
EffNetV2 EffNetV2 EffNetV2
85 ViT(21k) 85 ViT(21k) 85 ViT(21k)
82 82 82
81 81 81
80 RegNetY
80 RegNetY
80 RegNetY
0 50 100 150 200 250 300 0 50 100 150 200 0 50 100 150 200 250 300
Parameters (M) FLOPS (B) Latency(ms)
(a) Parameters (b) FLOPs (c) GPU V100 Latency (batch 16)
Figure 5. Model Size, FLOPs, and Inference Latency – Latency is measured with batch size 16 on V100 GPU. 21k denotes pretrained
on ImageNet21k images, others are just trained on ImageNet ILSVRC2012. Our EfficientNetV2 has slightly better parameter efficiency
with EfficientNet, but runs 3x faster for inference.
7
EfficientNetV2: Smaller Models and Faster Training
Table 8. Transfer Learning Performance Comparison – All models are pretrained on ImageNet ILSVRC2012 and finetuned on
downstream datasets. Transfer learning accuracy is averaged over five runs.
Model Params ImageNet Acc. CIFAR-10 CIFAR-100 Flowers Cars
GPipe (Huang et al., 2019) 556M 84.4 99.0 91.3 98.8 94.7
ConvNets
EfficientNet-B7 (Tan & Le, 2019a) 66M 84.7 98.9 91.7 98.8 94.7
ViT-B/32 (Dosovitskiy et al., 2021) 88M 73.4 97.8 86.3 85.4 -
ViT-B/16 (Dosovitskiy et al., 2021) 87M 74.9 98.1 87.1 89.5 -
Vision ViT-L/32 (Dosovitskiy et al., 2021) 306M 71.2 97.9 87.1 86.4 -
Transformers ViT-L/16 (Dosovitskiy et al., 2021) 306M 76.5 97.9 86.4 89.7 -
DeiT-B (ViT+regularization) (Touvron et al., 2021) 86M 81.8 99.1 90.8 98.4 92.1
DeiT-B-384 (ViT+regularization) (Touvron et al., 2021) 86M 83.1 99.1 90.8 98.5 93.3
EfficientNetV2-S 24M 83.2 98.7±0.04 91.5±0.11 97.9±0.13 93.8±0.11
ConvNets
(ours) EfficientNetV2-M 55M 85.1 99.0±0.08 92.2±0.08 98.5±0.08 94.6±0.10
EfficientNetV2-L 121M 85.7 99.1±0.03 92.3±0.13 98.8±0.05 95.1±0.10
• Scaling up data size is more effective than simply scal- Results: Table 8 compares the transfer learning perfor-
ing up model size in high-accuracy regime: when the mance. In general, our models outperform previous Con-
top-1 accuracy is beyond 85%, it is very difficult to vNets and Vision Transformers for all these datasets, some-
further improve it by simply increasing model size times by a non-trivial margin: for example, on CIFAR-100,
due to the severe overfitting. However, the extra Im- EfficientNetV2-L achieves 0.6% better accuracy than prior
ageNet21K pretraining can significantly improve ac- GPipe/EfficientNets and 1.5% better accuracy than prior
curacy. The effectiveness of large datasets is also ob- ViT/DeiT models. These results suggest that our models
served in previous works (Mahajan et al., 2018; Xie also generalize well beyond ImageNet.
et al., 2020; Dosovitskiy et al., 2021).
• Pretraining on ImageNet21k could be quite efficient. 6. Ablation Studies
Although ImageNet21k has 10x more data, our training 6.1. Comparison to EfficientNet
approach enables us to finish the pretraining of Effi-
cientNetV2 within two days using 32 TPU cores (in- In this section, we will compare our EfficientNetV2 (V2 for
stead of weeks for ViT (Dosovitskiy et al., 2021)). This short) with EfficientNets (Tan & Le, 2019a) (V1 for short)
is more effective than training larger models on Ima- under the same training and inference settings.
geNet. We suggest future research on large-scale mod-
els use the public ImageNet21k as a default dataset. Performance with the same training: Table 10 shows the
performance comparison using the same progressive learn-
ing settings. As we apply the same progressive learning
5.3. Transfer Learning Datasets
to EfficientNet, its training speed (reduced from 139h to
Setup: We evaluate our models on four transfer learning 54h) and accuracy (improved from 84.7% to 85.0%) are
datasets: CIFAR-10, CIFAR-100, Flowers and Cars. Table better than the original paper (Tan & Le, 2019a). How-
9 includes the statistics of these datasets. ever, as shown in Table 10, our EfficientNetV2 models still
outperform EfficientNets by a large margin: EfficientNetV2-
Table 9. Transfer learning datasets. M reduces parameters by 17% and FLOPs by 37%, while
Train images Eval images Classes running 4.1x faster in training and 3.1x faster in inference
CIFAR-10 (Krizhevsky & Hinton, 2009) 50,000 10,000 10
CIFAR-100 (Krizhevsky & Hinton, 2009) 50,000 10,000 100
than EfficientNet-B7. Since we are using the same training
Flowers (Nilsback & Zisserman, 2008) 2,040 6,149 102 settings here, we attribute the gains to the EfficientNetV2
Cars (Krause et al., 2013) 8,144 8,041 196
architecture.
For this experiment, we use the checkpoints trained on Ima- Table 10. Comparison with the same training settings – Our new
geNet ILSVRC2012. For fair comparison, no ImageNet21k EfficientNetV2-M runs faster with less parameters.
images are used here. Our finetuning settings are mostly the Acc. Params FLOPs TrainTime InferTime
same as ImageNet training with a few modifications similar (%) (M) (B) (h) (ms)
to (Dosovitskiy et al., 2021; Touvron et al., 2021): We use V1-B7 85.0 66 38 54 170
V2-M (ours) 85.1 55 (-17%) 24 (-37%) 13 (-76%) 57 (-66%)
smaller batch size 512, smaller initial learning rate 0.001
with cosine decay. For all datasets, we train each model for
fixed 10,000 steps. Since each model is finetuned with very Scaling Down: Previous sections mostly focus on large-
few steps, we disable weight decay and use a simple cutout scale models. Here we compare smaller models by scaling
data augmentation. down our EfficientNetV2-S using similar compound scaling
8
EfficientNetV2: Smaller Models and Faster Training
coefficients as EfficientNet. For easy comparison, all mod- 6.3. Importance of Adaptive Regularization
els are trained without progressive learning. Compared to
A key insight from our training approach is the adaptive
these small-size EfficientNets (V1), our new EfficientNetV2
regularization, which dynamically adjusts regularization
models are generally faster while maintaining comparable
according to image size. This paper chooses a simple pro-
parameter efficiency.
gressive approach for its simplicity, but it is also a general
method that can be combined with other approaches as well.
Table 11. Scaling down model size – We measure the inference
throughput on V100 FP16 GPU with batch size 128. Table 13 studies our adaptive regularization on two training
Top-1 Acc. Parameters Throughput (imgs/sec) settings: one is to progressively increase image size from
V1-B1 79.0% 7.8M 2675
small to large (Howard, 2018), and the other is to randomly
V2-7M 78.7% 7.4M (2.1x) 5739 sample a different image size for each batch as proposed in
V1-B2 79.8% 9.1M 2003 Mix&Match (Hoffer et al., 2019). Because TPU needs to
V2-8M 79.8% 8.1M (2.0x) 3983 recompile the graph for each new size, here we randomly
V1-B4 82.9% 19M 628
V2-14M 82.1% 14M (2.7x) 1693
sample a image size every eight epochs instead of every
V1-B5 83.7% 30M 291 batch. Compared to the vanilla approaches of progressive
V2-S 83.6% 24M (3.1x) 901 or random resizing that use the same regularization for all
image sizes, our adaptive regularization improves the accu-
racy by 0.7%. Figure 6 further compares the training curve
6.2. Progressive Learning for Different Networks for the progressive approach. Our adaptive regularization
uses much smaller regularization for small images at the
We ablate the performance of our progressive learning for early training epochs, allowing models to converge faster
different networks. Table 12 shows the performance com- and achieve better final accuracy.
parison between our progressive training and the baseline
training, using the same ResNet and EfficientNet models. Table 13. Adaptive regularization – We compare ImageNet top-1
Here, the baseline ResNets have higher accuracy than the accuracy based on the average of three runs.
original paper (He et al., 2016) because they are trained with Vanilla +our adaptive reg
our improved training settings (see Section 5) using more Progressive resize (Howard, 2018) 84.3±0.14 85.1±0.07 (+0.8)
epochs and better optimizers. We also increase the image Random resize (Hoffer et al., 2019) 83.5±0.11 84.2±0.10 (+0.7)
size from 224 to 380 for ResNets to further increase the
network capacity and accuracy. 80
ImageNet Top-1 Accuracy (%)
60
9
EfficientNetV2: Smaller Models and Faster Training
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, Gupta, S. and Akin, B. Accelerator-aware neural network
J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, design using automl. On-device Intelligence Workshop in
M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., SysML, 2020.
Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke,
M., Yu, Y., and Zheng, X. Tensorflow: A system for Gupta, S. and Tan, M. Efficientnet-edgetpu: Cre-
large-scale machine learning. OSDI, 2016. ating accelerator-optimized neural networks with au-
toml. https://ai.googleblog.com/2019/08/efficientnet-
Bello, I. Lambdanetworks: Modeling long-range interac- edgetpu-creating.html, 2019.
tions without attention. ICLR, 2021.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., learning for image recognition. CVPR, pp. 770–778,
Lin, T.-Y., Shlens, J., and Zoph, B. Revisiting resnets: 2016.
Improved training and scaling strategies. arXiv preprint Hoffer, E., Weinstein, B., Hubara, I., Ben-Nun, T., Hoefler,
arXiv:2103.07579, 2021. T., and Soudry, D. Mix & match: training convnets with
mixed image sizes for improved accuracy, speed and scale
Bengio, Y., Louradour, J., Collobert, R., and Weston, J.
resiliency. arXiv preprint arXiv:1908.08986, 2019.
Curriculum learning. ICML, 2009.
Howard, J. Training imagenet in 3 hours for 25 minutes.
Brock, A., De, S., Smith, S. L., and Simonyan, K. High- https://www.fast.ai/2018/04/30/dawnbench-fastai/, 2018.
performance large-scale image recognition without nor-
malization. arXiv preprint arXiv:2102.06171, 2021. Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,
K. Q. Deep networks with stochastic depth. ECCV, pp.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, 646–661, 2016.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, K. Q. Densely connected convolutional networks. CVPR,
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., 2017.
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
S., Radford, A., Sutskever, I., and Amodei, D. Language Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,
models are few-shot learners. NeurIPS, 2020. Q. V., and Chen, Z. Gpipe: Efficient training of giant
neural networks using pipeline parallelism. NeurIPS,
Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural 2019.
architecture search on target task and hardware. ICLR,
Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progres-
2019.
sive growing of gans for improved quality, stability, and
Chen, Y., Yang, T., Zhang, X., Meng, G., Pan, C., and Sun, variation. ICLR, 2018.
J. Detnas: Neural architecture search on object detection. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung,
NeurIPS, 2019. J., Gelly, S., and Houlsby, N. Big transfer (bit): General
visual representation learning. ECCV, 2020.
Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Ran-
daugment: Practical automated data augmentation with a Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a
reduced search space. ECCV, 2020. large-scale dataset of fine-grained cars. Second Workshop
on Fine-Grained Visual Categorizatio, 2013.
Dong, X., Tan, M., Yu, A. W., Peng, D., Gabrys, B., and Le,
Q. V. Autohas: Efficient hyperparameter and architecture Krizhevsky, A. and Hinton, G. Learning multiple layers of
search. arXiv preprint arXiv:2006.03656, 2020. features from tiny images. Technical Report, 2009.
10
EfficientNetV2: Smaller Models and Faster Training
Li, S., Tan, M., Pang, R., Li, A., Cheng, L., Le, Q., and Tan, M., Chen, B., Pang, R., Vasudevan, V., and Le, Q. V.
Jouppi, N. Searching for fast model families on datacenter Mnasnet: Platform-aware neural architecture search for
accelerators. arXiv preprint arXiv:2102.05610, 2021. mobile. CVPR, 2019.
Liu, C., Chen, L.-C., Schroff, F., Adam, H., Hua, W., Yuille, Tan, M., Pang, R., and Le, Q. V. Efficientdet: Scalable and
A., and Fei-Fei, L. Auto-deeplab: Hierarchical neu- efficient object detection. CVPR, 2020.
ral architecture search for semantic image segmentation.
Touvron, H., Vedaldi, A., Douze, M., and Jégou, H. Fix-
CVPR, 2019.
ing the train-test resolution discrepancy. arXiv preprint
Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, arXiv:1906.06423, 2019.
M., Li, Y., Bharambe, A., and van der Maaten, L. Explor- Touvron, H., Vedaldi, A., Douze, M., and Jégou, H. Fix-
ing the limits of weakly supervised pretraining. arXiv ing the train-test resolution discrepancy: Fixefficientnet.
preprint arXiv:1805.00932, 2018. arXiv preprint arXiv:2003.08237, 2020.
Nilsback, M.-E. and Zisserman, A. Automated flower clas- Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
sification over a large number of classes. ICVGIP, pp. A., and Jégou, H. Training data-efficient image trans-
722–729, 2008. formers & distillation through attention. arXiv preprint
arXiv:2012.12877, 2021.
Press, O., Smith, N. A., and Lewis, M. Shortformer: Better
language modeling using shorter inputs. arXiv preprint Wightman, R. Pytorch image model. https://github.
arXiv:2012.15832, 2021. com/rwightman/pytorch-image-models, Ac-
cessed on Feb.18, 2021, 2021.
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., and
Dollár, P. Designing network design spaces. CVPR, 2020. Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,
Y., Vajda, P., Jia, Y., and Keutzer, K. Fbnet: Hardware-
Ridnik, T., Lawen, H., Noy, A., Baruch, E. B., Sharir, aware efficient convnet design via differentiable neural
G., and Friedman, I. Tresnet: High performance gpu- architecture search. CVPR, 2019.
dedicated architecture. arXiv preprint arXiv:2003.13630,
2020. Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self-
training with noisy student improves imagenet classifica-
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., tion. CVPR, 2020.
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
Xiong, Y., Liu, H., Gupta, S., Akin, B., Bender, G., Kinder-
M., et al. Imagenet large scale visual recognition chal-
mans, P.-J., Tan, M., Singh, V., and Chen, B. Mobiledets:
lenge. International Journal of Computer Vision, 115(3):
Searching for object detection architectures for mobile
211–252, 2015.
accelerators. arXiv preprint arXiv:2004.14525, 2020.
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Yu, H., Liu, A., Liu, X., Li, G., Luo, P., Cheng, R., Yang,
Chen, L.-C. Mobilenetv2: Inverted residuals and linear J., and Zhang, C. Pda: Progressive data augmentation
bottlenecks. CVPR, 2018. for general robustness of deep neural networks. arXiv
Sifre, L. Rigid-motion scattering for image classification. preprint arXiv:1909.04839, 2019.
Ph.D. thesis section 6.2, 2014. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Tay, F. E.,
Feng, J., and Yan, S. Tokens-to-token vit: Training vision
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel,
transformers from scratch on imagenet. arXiv preprint
P., and Vaswani, A. Bottleneck transformers for visual
arXiv:2101.11986, 2021.
recognition. arXiv preprint arXiv:2101.11605, 2021.
Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Mixup: Beyond empirical risk minimization. ICLR, 2018.
and Salakhutdinov, R. Dropout: a simple way to prevent
neural networks from overfitting. The Journal of Machine Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang,
Learning Research, 15(1):1929–1958, 2014. Z., Sun, Y., He, T., Mueller, J., Manmatha, R., Li, M.,
and Smola, A. Resnest: Split-attention networks. arXiv
Tan, M. and Le, Q. V. Efficientnet: Rethinking model preprint arXiv:2012.12877, 2020.
scaling for convolutional neural networks. ICML, 2019a.
Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning
Tan, M. and Le, Q. V. Mixconv: Mixed depthwise convolu- transferable architectures for scalable image recognition.
tional kernels. BMVC, 2019b. CVPR, 2018.
11
EfficientNetV2: Smaller Models and Faster Training
Appendix
A1. Source Images
Figure 7 provides the original Panda and Snoek images used for mixup in Figure 4. These two images are first augmented
with RandAug and then combined with Mixup to generate the final image in Figure 4.
Table 15. Throughtput (imgs/sec/gpu) comparison (higher is better). All models have similar ImageNet top-1 accuracy.
batch=1 batch=4 batch=16 batch=64 batch=256
EfficientNetV2-M 23 91 281 350 352
EfficientNet-B7 24 80 94 108 OOM
NFNet-F2 16 62 129 189 190
Batch size is another often overlooked factor. We observe that different models can have different behavior when batch size
changes. Table 15 shows a few model examples: compared to NFNet and EfficientNetV2-M, the latency for EfficientNet-B7
becomes significantly worse for large batch size. We suspect this is because EfficientNet-B7 uses much larger image size,
leading to expensive memory access overhead. This paper follows previous works (Zhang et al., 2020) and uses batch size
16 in default unless explicitly specified.
12