DSFD: Dual Shot Face Detector

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

DSFD: Dual Shot Face Detector

Jian Li† Yabiao Wang‡ Changan Wang‡ Ying Tai‡


Jianjun Qian†∗ Jian Yang†∗ Chengjie Wang‡ Jilin Li‡ Feiyue Huang‡

PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education

Jiangsu Key Lab of Image and Video Understanding for Social Security

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
arXiv:1810.10220v3 [cs.CV] 6 Apr 2019


Youtu Lab, Tencent

[email protected], {csjqian, csjyang}@njust.edu.cn

{casewang, changanwang, yingtai, jasoncjwang, jerolinli, garyhuang}@tencent.com

Scale Blurry Illumination

Pose & Occlusion Reflection Makeup

Figure 1: Visual results. Our method is robust to various variations on scale, blurry, illumination, pose, occlusion, reflection and makeup.

Abstract mentation to provide better initialization for the regressor.


Since these techniques are all related to the two-stream de-
In this paper, we propose a novel face detection network sign, we name the proposed network as Dual Shot Face De-
with three novel contributions that address three key aspects tector (DSFD). Extensive experiments on popular bench-
of face detection, including better feature learning, progres- marks, WIDER FACE and FDDB, demonstrate the superi-
sive loss design and anchor assign based data augmenta- ority of DSFD over the state-of-the-art face detectors.
tion, respectively. First, we propose a Feature Enhance
Module (FEM) for enhancing the original feature maps to
extend the single shot detector to dual shot detector. Sec- 1. Introduction
ond, we adopt Progressive Anchor Loss (PAL) computed by
two different sets of anchors to effectively facilitate the fea- Face detection is a fundamental step for various facial
tures. Third, we use an Improved Anchor Matching (IAM) applications, like face alignment [26], parsing [3], recog-
by integrating novel anchor assign strategy into data aug- nition [34], and verification [6]. As the pioneering work
∗ Jianjun
for face detection, Viola-Jones [29] adopts AdaBoost algo-
Qian and Jian Yang are corresponding authors. This work
was supported by the National Science Fund of China under Grant Nos.
rithm with hand-crafted features, which are now replaced by
61876083, U1713208, and Program for Changjiang Scholars. This work deeply learned features from the convolutional neural net-
was done when Jian Li was an intern at Tencent Youtu Lab. work (CNN) [10] that achieves great progress. Although

1
the CNN based face detectors have being extensively stud- anchor sizes in the first shot, and use larger sizes in the
ied, detecting faces with high degree of variability in scale, second shot. Third, we propose Improved Anchor Match-
pose, occlusion, expression, appearance and illumination in ing (IAM), which integrates anchor partition strategy and
real-world scenarios remains a challenge. anchor-based data augmentation to better match anchors
Previous state-of-the-art face detectors can be roughly and ground truth faces, and thus provides better initializa-
divided into two categories. The first one is mainly based tion for the regressor. The three aspects are complementary
on the Region Proposal Network (RPN) adopted in Faster so that these techniques can work together to further im-
RCNN [24] and employs two stage detection schemes [30, prove the performance. Besides, since these techniques are
33, 36]. RPN is trained end-to-end and generates high- all related to two-stream design, we name the proposed net-
quality region proposals which are further refined by Fast work as Dual Shot Face Detector (DSFD). Fig. 1 shows the
R-CNN detector. The other one is Single Shot Detec- effectiveness of DSFD on various variations, especially on
tor (SSD) [20] based one-stage methods, which get rid of extreme small faces or heavily occluded faces.
RPN, and directly predict the bounding boxes and confi- In summary, the main contributions of this paper include:
dence [4, 27, 39]. Recently, one-stage face detection frame- • A novel Feature Enhance Module to utilize different
work has attracted more attention due to its higher inference level information and thus obtain more discriminability and
efficiency and straightforward system deployment. robustness features.
Despite the progress achieved by the above methods, • Auxiliary supervisions introduced in early layers via a
there are still some problems existed in three aspects: set of smaller anchors to effectively facilitate the features.
Feature learning Feature extraction part is essential for • An improved anchor matching strategy to match an-
a face detector. Currently, Feature Pyramid Network chors and ground truth faces as far as possible to provide
(FPN) [17] is widely used in state-of-the-art face detectors better initialization for the regressor.
for rich features. However, FPN just aggregates hierarchi- • Comprehensive experiments conducted on popular
cal feature maps between high and low-level output layers, benchmarks FDDB and WIDER FACE to demonstrate the
which does not consider the current layer’s information, and superiority of our proposed DSFD network compared with
the context relationship between anchors is ignored. the state-of-the-art methods.
Loss design The conventional loss functions used in object
detection include a regression loss for the face region and 2. Related work
a classification loss for identifying if a face is detected or We review the prior works from three perspectives.
not. To further address the class imbalance problem, Lin et Feature Learning Early works on face detection mainly
al. [18] propose Focal Loss to focus training on a sparse set rely on hand-crafted features, such as Harr-like fea-
of hard examples. To use all original and enhanced features, tures [29], control point set [1], edge orientation his-
Zhang et al. propose Hierarchical Loss to effectively learn tograms [13]. However, hand-crafted features design is lack
the network [37]. However, the above loss functions do not of guidance. With the great progress of deep learning, hand-
consider progressive learning ability of feature maps in both crafted features have been replaced by Convolutional Neu-
of different levels and shots. ral Networks (CNN). For example, Overfeat [25], Cascade-
Anchor matching Basically, pre-set anchors for each fea- CNN [14], MTCNN [38] adopt CNN as a sliding window
ture map are generated by regularly tiling a collection of detector on image pyramid to build feature pyramid. How-
boxes with different scales and aspect ratios on the image. ever, using an image pyramid is slow and memory ineffi-
Some works [27, 39] analyze a series of reasonable anchor cient. As the result, most two stage detectors extract fea-
scales and anchor compensation strategy to increase posi- tures on single scale. R-CNN [7, 8] obtains region propos-
tive anchors. However, such strategy ignores random sam- als by selective search [28], and then forwards each nor-
pling in data augmentation, which still causes imbalance be- malized image region through a CNN to classify. Faster
tween positive and negative anchors. R-CNN [24], R-FCN [5] employ Region Proposal Network
In this paper, we propose three novel techniques to ad- (RPN) to generate initial region proposals. Besides, ROI-
dress the above three issues, respectively. First, we intro- pooling [24] and position-sensitive RoI pooling [5] are ap-
duce a Feature Enhance Module (FEM) to enhance the dis- plied to extract features from each region.
criminability and robustness of the features, which com- More recently, some research indicates that multi-scale
bines the advantages of the FPN in PyramidBox and Re- features perform better for tiny objects. Specifically,
ceptive Field Block (RFB) in RFBNet [19]. Second, moti- SSD [20], MS-CNN [2], SSH [23], S3FD [39] predict
vated by the hierarchical loss [37] and pyramid anchor [27] boxes on multiple layers of feature hierarchy. FCN [22],
in PyramidBox, we design Progressive Anchor Loss (PAL) Hypercolumns [9], Parsenet [21] fuse multiple layer fea-
that uses progressive anchor sizes for not only different lev- tures in segmentation. FPN [15, 17], a top-down architec-
els, but also different shots. Specifically, we assign smaller ture, integrate high-level semantic information to all scales.
(a) Original Feature Shot
Input Image conv3_3 conv4_3 conv5_3 conv_fc7 conv6_2 conv7_2

First Shot PAL


(b) Feature Enhance Module

Second Shot PAL


640x640 160x160 80x80 40x40 20x20 10x10 5x5

(c) Enhanced Feature Shot


Figure 2: Our DSFD framework uses a Feature Enhance Module (b) on top of a feedforward VGG/ResNet architecture to generate the
enhanced features (c) from the original features (a), along with two loss layers named first shot PAL for the original features and second
shot PAL for the enchanted features.
Current feature map
FPN-based methods, such as FAN [31], PyramidBox [27]
N/3
achieve significant improvement on detection. However,
product N/3 concat
these methods do not consider the current layers informa-
tion. Different from the above methods that ignore the con- 1x1 N
conv N/3
text relationship between anchors, we propose a feature en-
Up feature map

hance module that incorporates multi-level dilated convolu-


tional layers to enhance the semantic of the features. 1x1
conv upsample dilation conv,kernel=3x3,rate=3
Loss Design Generally, the objective loss in detection is a
weighted sum of classification loss (e.g. softmax loss) and Figure 3: Illustration on Feature Enhance Module, in which
box regression loss (e.g. L2 loss). Girshick et al. [7] pro- the current feature map cell interactives with neighbors in current
pose smooth L1 loss to prevent exploding gradients. Lin feature maps and up feature maps.
et al. [18] discover that the class imbalance is one obsta-
cle for better performance in one stage detector, hence they 3. Dual Shot Face Detector
propose focal loss, a dynamically scaled cross entropy loss. We firstly introduce the pipeline of our proposed frame-
Besides, Wang et al. [32] design RepLoss for pedestrian de- work DSFD, and then detailly describe our feature enhance
tection, which improves performance in occlusion scenar- module in Sec. 3.2, progressive anchor loss in Sec. 3.3 and
ios. FANet [37] create a hierarchical feature pyramid and improved anchor matching in Sec. 3.4, respectively.
presents hierarchical loss for their architecture. However,
the anchors used in FANet are kept the same size in dif- 3.1. Pipeline of DSFD
ferent stages. In this work, we adaptively choose different
anchor sizes in different stages to facilitate the features. The framework of DSFD is illustrated in Fig. 2. Our
architecture uses the same extended VGG16 backbone as
Anchor Matching To make the model more robust, most PyramidBox [27] and S3FD [39], which is truncated be-
detection methods [20,35,39] do data augmentation, such as fore the classification layers and added with some aux-
color distortion, horizontal flipping, random crop and multi- iliary structures. We select conv3 3, conv4 3, conv5 3,
scale training. Zhang et al. [39] propose an anchor compen- conv fc7, conv6 2 and conv7 2 as the first shot detec-
sation strategy to make tiny faces to match enough anchors tion layers to generate six original feature maps named
during training. Wang et al. [35] propose random crop to of1 , of2 , of3 , of4 , of5 , of6 . Then, our proposed FEM trans-
generate large number of occluded faces for training. How- fers these original feature maps into six enhanced feature
ever, these methods ignore random sampling in data aug- maps named ef1 , ef2 , ef3 , ef4 , ef5 , ef6 , which have the
mentation, while ours combines anchor assign to provide same sizes as the original ones and are fed into SSD-style
better data initialization for anchor matching. head to construct the second shot detection layers. Note that
the input size of the training image is 640, which means the Table 1: The stride size, feature map size, anchor scale, ratio, and
feature map size of the lowest-level layer to highest-level number of six original/enhanced features for two shots.
Feature Stride Size Scale Ratio Number
layer is from 160 to 5. Different from S3FD and Pyramid- ef 1 (of 1) 4 160 × 160 16 (8) 1.5 : 1 25600
Box, after we utilize the receptive field enlargement in FEM ef 2 (of 2) 8 80 × 80 32 (16) 1.5 : 1 6400
and the new anchor design strategy, its unnecessary for the ef 3 (of 3) 16 40 × 40 64 (32) 1.5 : 1 1600
three sizes of stride, anchor and receptive field to satisfy ef 4 (of 4) 32 20 × 20 128 (64) 1.5 : 1 400
ef 5 (of 5) 64 10 × 10 256 (128) 1.5 : 1 100
equal-proportion interval principle. Therefore, our DSFD is ef 6 (of 6) 128 5×5 512 (256) 1.5 : 1 25
more flexible and robustness. Besides, the original and en-
hanced shots have two different losses, respectively named
First Shot progressive anchor Loss (FSL) and Second Shot vs. background), and Lloc is the smooth L1 loss between the
progressive anchor Loss (SSL). parameterizations of the predicted box ti and ground-truth
box gi using the anchor ai . When p∗i = 1 (p∗i = {0, 1}),
3.2. Feature Enhance Module the anchor ai is positive and the localization loss is acti-
vated. β is a weight to balance the effects of the two terms.
Feature Enhance Module is able to enhance original fea-
Compared to the enhanced feature maps in the same level,
tures to make them more discriminable and robust, which
the original feature maps have less semantic information for
is called FEM for short. For enhancing original neuron cell
classification but more high resolution location information
oc(i,j,l) , FEM utilizes different dimension information in-
for detection. Therefore, we believe that the original feature
cluding upper layer original neuron cell oc(i,j,l) and current
maps can detect and classify smaller faces. As the result, we
layer non-local neuron cells: nc(i−ε,j−ε,l) , nc(i−ε,j,l) , ...,
propose the First Shot multi-task Loss with a set of smaller
nc(i,j+ε,l) , nc(i+ε,j+ε,l) . Specially, the enhanced neuron
anchors as follows:
cell ec(i,j,l) can be mathematically defined as follow:
1
Σi Lconf (pi , p∗i )
LF SL (pi , p∗i , ti , gi , sai ) =
ec(i,j,l) = fconcat (fdilation (nc(i,j,l) )) Nconf
(1)
nci,j,l = fprod (oc(i,j,l) , fup (oc(i,j,l+1) )) β
+ Σi p∗i Lloc (ti , gi , sai ),
Nloc
where ci,j,l is a cell located in (i, j) coordinate of the feature (3)
maps in the l-th layer, f denotes a set of basic dilation con- where sa indicates the smaller anchors in the first shot lay-
volution, elem-wise production, up-sampling or concatena- ers, and the two shots losses can be weighted summed into
tion operations. Fig. 3 illustrates the idea of FEM, which is a whole Progressive Anchor Loss as follows:
inspired by FPN [17] and RFB [19]. Here, we first use 1×1
convolutional kernel to normalize the feature maps. Then, LP AL = LF SL (sa) + λLSSL (a). (4)
we up-sample upper feature maps to do element-wise prod-
Note that anchor size in the first shot is half of ones in the
uct with the current ones. Finally, we split the feature maps
second shot, and λ is weight factor. Detailed assignment
to three parts, followed by three sub-networks containing
on the anchor size is described in Sec. 3.4. In prediction
different numbers of dilation convolutional layers.
process, we only use the output of the second shot, which
3.3. Progressive Anchor Loss means no additional computational cost is introduced.
Different from the traditional detection loss, we design 3.4. Improved Anchor Matching
progressive anchor sizes for not only different levels, but
Current anchor matching method is bidirectional be-
also different shots in our framework. Motivated by the
tween the anchor and ground-truth face. Therefore, an-
statement in [24] that low-level features are more suitable
chor design and face sampling during augmentation are col-
for small faces, we assign smaller anchor sizes in the first
laborative to match the anchors and faces as far as pos-
shot, and use larger sizes in the second shot. First, our Sec-
sible for better initialization of the regressor. Our IAM
ond Shot anchor-based multi-task Loss function is defined
targets on addressing the contradiction between the dis-
as:
crete anchor scales and continuous face scales, in which
1 the faces are augmented by Sinput ∗ Sf ace /Sanchor (S in-
(Σi Lconf (pi , p∗i )
LSSL (pi , p∗i , ti , gi , ai ) =
Nconf dicates the spatial size) with the probability of 40% so as
β to increase the positive anchors, stabilize the training and
+ Σi p∗i Lloc (ti , gi , ai )), thus improve the results. Table 1 shows details of our an-
Nloc
(2) chor design on how each feature map cell is associated to
where Nconf and Nloc indicate the number of positive and the fixed shape anchor. We set anchor ratio 1.5:1 based
negative anchors, and the number of positive anchors re- on face scale statistics. Anchor size for the original fea-
spectively, Lconf is the softmax loss over two classes (face ture is one half of the enhanced feature. Additionally, with
Table 2: Effectiveness of Feature Enhance Module on the AP
performance.
Component Easy Medium Hard
FSSD+VGG16 92.6% 90.2% 79.1%
FSSD+VGG16+FEM 93.0% 91.4% 84.6%

Table 3: Effectiveness of Progressive Anchor Loss on the AP


performance.
Component Easy Medium Hard
FSSD+RES50 93.7% 92.2% 81.8%
FSSD+RES50+FEM 95.0% 94.1% 88.0% Figure 5: Comparisons on number distribution of matched
FSSD+RES50+FEM+PAL 95.3% 94.4% 88.6% anchor for ground truth faces between traditional anchor match-
ing (blue line) and our improved anchor matching (red line). we
actually set the IoU threshold to 0.35 for the traditional version.
That means even with a higher threshold (i.e., 0.4), using our IAM,
we can still achieve more matched anchors. Here, we choose a
slightly higher threshold in IAM so that to better balance the num-
ber and quality of the matched faces.

4.2. Analysis on DSFD


Figure 4: The number distribution of different scales of faces
compared between traditional anchor matching (Left) and our im- In this subsection, we conduct extensive experiments and
proved anchor matching (Right). ablation studies on the WIDER FACE dataset to evaluate
the effectiveness of several contributions of our proposed
probability of 2/5, we utilize anchor-based sampling like framework, including feature enhance module, progressive
data-anchor-sampling in PyramidBox, which randomly se- anchor loss, and improved anchor matching. For fair com-
lects a face in an image, crops sub-image containing the parisons, we use the same parameter settings for all the ex-
face, and sets the size ratio between sub-image and selected periments, except for the specified changes to the compo-
face to 640/rand (16, 32, 64, 128, 256, 512). For the remain- nents. All models are trained on the WIDER FACE training
ing 3/5 probability, we adopt data augmentation similar to set and evaluated on validation set. To better understand
SSD [20]. In order to improve the recall rate of faces and DSFD, we select different baselines to ablate each compo-
ensure anchor classification ability simultaneously, we set nent on how this part affects the final performance.
Intersection-over-Union (IoU) threshold 0.4 to assign an- Feature Enhance Module First, We adopt anchor designed
chor to its ground-truth faces. in S3FD [39], PyramidBox [27] and six original feature
maps generated by VGG16 to perform classification and re-
4. Experiments gression, which is named Face SSD (FSSD) as the baseline.
We then use VGG16-based FSSD as the baseline to add
4.1. Implementation Details feature enchance module for comparison. Table 2 shows
First, we present the details in implementing our net- that our feature enhance module can improve VGG16-based
work. The backbone networks are initialized by the pre- FSSD from 92.6%, 90.2%, 79.1% to 93.0%, 91.4%, 84.6%.
trained VGG/ResNet on ImageNet. All newly added con- Progressive Anchor Loss Second, we use Res50-based
volution layers’ parameters are initialized by the ‘xavier’ FSSD as the baseline to add progressive anchor loss for
method. We use SGD with 0.9 momentum, 0.0005 weight comparison. We use four residual blocks’ ouputs in
decay to fine-tune our DSFD model. The batch size is set to ResNet to replace the outputs of conv3 3, conv4 3, conv5 3,
16. The learning rate is set to 10−3 for the first 40k steps, conv fc7 in VGG. Except for VGG16, we do not perform
and we decay it to 10−4 and 10−5 for two 10k steps. layer normalization. Table 3 shows our progressive an-
During inference, the first shot’s outputs are ignored chor loss can improve Res50-based FSSD using FEM from
and the second shot predicts top 5k high confident detec- 95.0%, 94.1%, 88.0% to 95.3%, 94.4%, 88.6%.
tions. Non-maximum suppression is applied with jaccard Improved Anchor Matching To evaluate our improved
overlap of 0.3 to produce top 750 high confident bound- anchor matching strategy, we use Res101-based FSSD
ing boxes per image. For 4 bounding box coordinates, we without anchor compensation as the baseline. Table 4 shows
round down top left coordinates and round up width and that our improved anchor matching can improve Res101-
height to expand the detection bounding box. The offi- based FSSD using FEM from 95.8%, 95.1%, 89.7% to
cial code has been released at: https://github.com/ 96.1%, 95.2%, 90.0%. Finally, we can improve our DSFD
TencentYoutuResearch/FaceDetection-DSFD. to 96.6%, 95.7%, 90.4% with ResNet152 as the backbone.
Val: easy Val: medium Val: hard

Test: easy Test: medium Test: hard


Figure 6: Precision-recall curves on WIDER FACE validation and testing subset.
Table 4: Effectiveness of Improved Anchor Matching on the AP performance.
Component Easy Medium Hard
FSSD+RES101 95.1% 93.6% 83.7%
FSSD+RES101+FEM 95.8% 95.1% 89.7%
FSSD+RES101+FEM+IAM 96.1% 95.2% 90.0%
FSSD+RES101+FEM+IAM+PAL 96.3% 95.4% 90.1%
FSSD+RES152+FEM+IAM+PAL 96.6% 95.7% 90.4%
FSSD+RES152+FEM+IAM+PAL+LargeBS 96.4% 95.7% 91.2%

Table 5: Effectiveness of different backbones.


Component Params ACC@Top-1 Easy Medium Hard
FSSD+RES101+FEM+IAM+PAL 399M 77.44% 96.3% 95.4% 90.1%
FSSD+RES152+FEM+IAM+PAL 459M 78.42% 96.6% 95.7% 90.4%
FSSD+SE-RES101+FEM+IAM+PAL 418M 78.39% 95.7% 94.7% 88.6%
FSSD+DPN98+FEM+IAM+PAL 515M 79.22% 96.3% 95.5% 90.4%
FSSD+SE-RESNeXt101 32×4d+FEML+IAM+PA 416M 80.19% 95.7% 94.8% 88.9%

Table 6: FEM vs. RFB on WIDER FACE. Comparison with RFB Our FEM differs from RFB in two
Backbone - ResNet101 (%) Easy Medium Hard
DSFD (RFB) 96.0 94.5 87.2
aspects. First, our FEM is based on FPN to make full use of
DSFD (FPN) / (FPN+RFB) 96.2 / 96.2 95.1 / 95.3 89.7 / 89.9 feature information from different spatial levels, while RFB
DSFD (FEM) 96.3 95.4 90.1 ignores. Second, our FEM adopts stacked dilation convolu-
tions in a multi-branch structure, which efficiently leads to
Besides, Fig. 4 shows that our improved anchor match- larger Receptive Fields (RF) than RFB that only uses one
ing strategy greatly increases the number of ground truth dilation layer in each branch, e.g., R3 in FEM compared to
faces that are closed to the anchor, which can reduce the R in RFB where indicates the RF of one dilation convolu-
contradiction between the discrete anchor scales and con- tion. Tab. 6 clearly demonstrates the superiority of our FEM
tinuous face scales. Moreover, Fig. 5 shows the number dis- over RFB, even when RFB is equipped with FPN.
tribution of matched anchor number for ground truth faces, From the above analysis and results, some promising
which indicates our improved anchor matching can signif- conclusions can be drawn: 1) Feature enhance is crucial.
icantly increase the matched anchor number, and the aver- We use a more robust and discriminative feature enhance
aged number of matched anchor for different scales of faces module to improve the feature presentation ability, espe-
can be improved from 6.4 to about 6.9. cially for hard face. 2) Auxiliary loss based on progressive
Discontinous ROC curves Continous ROC curves

Discontinous ROC curves Continous ROC curves


Figure 7: Comparisons with popular state-of-the-art methods on the FDDB dataset. The first row shows the ROC results without
additional annotations, and the second row shows the ROC results with additional annotations.

anchor is used to train all 12 different scale detection feature For VGA resolution inputs to Res50-based DSFD, it runs
maps, and it improves the performance on easy, medium 22 FPS on NVIDA GPU P40 during inference.
and hard faces simultaneously. 3) Our improved anchor
matching provides better initial anchors and ground-truth 4.3. Comparisons with State-of-the-Art Methods
faces to regress anchor from faces, which achieves the im-
We evaluate the proposed DSFD on two popular face
provements of 0.3%, 0.1%, 0.3% on three settings, respec-
detection benchmarks, including WIDER FACE [35] and
tively. Additionally, when we enlarge the training batch size
Face Detection Data Set and Benchmark (FDDB) [12]. Our
(i.e., LargeBS), the result in hard setting can get 91.2% AP.
model is trained only using the training set of WIDER
Effects of Different Backbones To better understand FACE, and then evaluated on both benchmarks without any
our DSFD, we further conducted experiments to examine further fine-tuning. We also follow the similar way used
how different backbones affect classification and detection in [31] to build the image pyramid for multi-scale testing
performance. Specifically, we use the same setting ex- and use more powerful backbone similar as [4].
cept for the feature extraction network, we implement SE- WIDER FACE Dataset It contains 393, 703 annotated
ResNet101, DPN−98, SE-ResNeXt101 32×4d following faces with large variations in scale, pose and occlusion in
the ResNet101 setting in our DSFD. From Table 5, DSFD total 32, 203 images. For each of the 60 event classes, 40%,
with SE-ResNeXt101 32×4d got 95.7%, 94.8%, 88.9%, on 10%, 50% images of the database are randomly selected
easy, medium and hard settings respectively, which indi- as training, validation and testing sets. Besides, each sub-
cates that more complexity model and higher Top-1 Ima- set is further defined into three levels of difficulty: ’Easy’,
geNet classification accuracy may not benefit face detection ’Medium’, ’Hard’ based on the detection rate of a baseline
AP. Therefore, in our DSFD framework, better performance detector. As shown in Fig. 6, our DSFD achieves the best
on classification are not necessary for better performance performance among all of the state-of-the-art face detectors
on detection, which is consistent to the conclusion claimed based on the average precision (AP) across the three sub-
in [11, 16]. Our DSFD enjoys high inference speed bene- sets, i.e., 96.6% (Easy), 95.7% (Medium) and 90.4% (Hard)
fited from simply using the second shot detection results. on validation set, and 96.0% (Easy), 95.3% (Medium) and
Scale Pose Occlusion Blurry

Makeup Illumination Modality Reflection

Figure 8: Illustration of our DSFD to various large variations on scale, pose, occlusion, blurry, makeup, illumination, modality and
reflection. Blue bounding boxes indicate the detector confidence is above 0.8.

90.0% (Hard) on test set. Fig. 8 shows more examples to 5. Conclusions


demonstrate the effects of DSFD on handling faces with This paper introduces a novel face detector named Dual
various variations, in which the blue bounding boxes indi- Shot Face Detector (DSFD). In this work, we propose a
cate the detector confidence is above 0.8. novel Feature Enhance Module that utilizes different level
FDDB Dataset It contains 5, 171 faces in 2, 845 images information and thus obtains more discriminability and ro-
taken from the faces in the wild data set. Since WIDER bustness features. Auxiliary supervisions introduced in
FACE has bounding box annotation while faces in FDDB early layers by using smaller anchors are adopted to ef-
are represented by ellipses, we learn a post-hoc ellipses re- fectively facilitate the features. Moreover, an improved an-
gressor to transform the final prediction results. As shown chor matching method is introduced to match anchors and
in Fig. 7, our DSFD achieves state-of-the-art performance ground truth faces as far as possible to provide better initial-
on both discontinuous and continuous ROC curves, i.e. ization for the regressor. Comprehensive experiments are
99.1% and 86.2% when the number of false positives equals conducted on popular face detection benchmarks, FDDB
to 1, 000. After adding additional annotations to those un- and WIDER FACE, to demonstrate the superiority of our
labeled faces [39], the false positives of our model can be proposed DSFD compared with the state-of-the-art face de-
further reduced and outperform all other methods. tectors, e.g., SRN and PyramidBox.
References [14] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and
Gang Hua. A convolutional neural network cascade for face
[1] Yotam Abramson, Bruno Steux, and Hicham Ghorayeb. Yet detection. In Proceedings of IEEE Conference on Computer
even faster (yef) real-time object detection. International Vision and Pattern Recognition (CVPR), 2015. 2
Journal of Intelligent Systems Technologies and Applica- [15] Jian Li, Jianjun Qian, and Jian Yang. Object detection via
tions, 2(2-3):102–112, 2007. 2 feature fusion based single network. In IEEE International
[2] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vas- Conference on Image Processing, 2017. 2
concelos. A unified multi-scale deep convolutional neural [16] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong
network for fast object detection. In Proceedings of Euro- Deng, and Jian Sun. Detnet: A backbone network for object
pean Conference on Computer Vision (ECCV), 2016. 2 detection. In Proceedings of European Conference on Com-
[3] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and Jian puter Vision, 2018. 7
Yang. Fsrnet: End-to-end learning face super-resolution with [17] Tsung-Yi Lin, Piotr Dollár, Ross B Girshick, Kaiming He,
facial priors. In Proceedings of the IEEE Conference on Bharath Hariharan, and Serge J Belongie. Feature pyra-
Computer Vision and Pattern Recognition, 2018. 1 mid networks for object detection. In Proceedings of IEEE
[4] Cheng Chi, Shifeng Zhang, Junliang Xing, Zhen Lei, Stan Z Conference on Computer Vision and Pattern Recognition
Li, and Xudong Zou. Selective refinement network for high (CVPR), 2017. 2, 4
performance face detection. In Proceedings of Association [18] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
for the Advancement of Artificial Intelligence (AAAI), 2019. Piotr Dollár. Focal loss for dense object detection. In Pro-
2, 7 ceedings of IEEE International Conference on Computer Vi-
[5] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object sion (ICCV), 2017. 2, 3
detection via region-based fully convolutional networks. In [19] Songtao Liu, Di Huang, and Yunhong Wang. Receptive field
Proceedings of Advances in Neural Information Processing block net for accurate and fast object detection. In Proceed-
Systems (NIPS), 2016. 2 ings of European Conference on Computer Vision, 2018. 2,
[6] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. Arc- 4
face: Additive angular margin loss for deep face recognition. [20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
arXiv:1801.07698v1, 2018. 1 Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C
[7] Ross Girshick. Fast r-cnn. In Proceedings of IEEE Inter- Berg. Ssd: Single shot multibox detector. In Proceedings
national Conference on Computer Vision (ICCV), 2015. 2, of European conference on computer vision (ECCV), 2016.
3 2, 3, 5
[21] Wei Liu, Andrew Rabinovich, and Alexander Berg. Parsenet:
[8] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
Looking wider to see better. In Proceedings of International
Malik. Rich feature hierarchies for accurate object detection
Conference on Learning Representations Workshop, 2016. 2
and semantic segmentation. In Proceedings of IEEE Confer-
[22] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
ence on Computer Vision and Pattern Recognition (CVPR),
convolutional networks for semantic segmentation. In Pro-
pages 580–587, 2014. 2
ceedings of IEEE Conference on Computer Vision and Pat-
[9] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Ji-
tern Recognition (CVPR), 2015. 2
tendra Malik. Hypercolumns for object segmentation and
[23] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and
fine-grained localization. In Proceedings of IEEE Confer-
Larry S Davis. Ssh: Single stage headless face detector. In
ence on Computer Vision and Pattern Recognition (CVPR),
Proceedings of IEEE International Conference on Computer
2015. 2
Vision (ICCV), 2017. 2
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [24] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Deep residual learning for image recognition. In Proceed- Faster r-cnn: Towards real-time object detection with region
ings of IEEE Conference on Computer Vision and Pattern proposal networks. In Proceedings of Advances in Neural
Recognition (CVPR), 2016. 1 Information Processing Systems (NIPS), 2015. 2, 4
[11] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, [25] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Math-
Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo- ieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated
jna, Yang Song, Sergio Guadarrama, and Kevin Murphy. recognition, localization and detection using convolutional
Speed/accuracy trade-offs for modern convolutional object networks. In Proceedings of International Conference on
detectors. In Proceedings of the IEEE Conference on Com- Learning Representations (ICLR), 2014. 2
puter Vision and Pattern Recognition, 2017. 7 [26] Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li,
[12] Vidit Jain and Erik Learned-Miller. Fddb: A benchmark Chengjie Wang, Feiyue Huang, and Yu Chen. Towards
for face detection in unconstrained settings. Technical highly accurate and stable face alignment for high-resolution
report, Technical Report UM-CS-2010-009, University of videos. In The AAAI Conference on Artificial Intelligence
Massachusetts, Amherst, 2010. 7 (AAAI), 2019. 1
[13] Kobi Levi and Yair Weiss. Learning object detection from a [27] Xu Tang, Daniel K Du, Zeqiang He, and Jingtuo Liu. Pyra-
small number of examples: the importance of good features. midbox: A context-assisted single shot face detector. In
In Proceedings of IEEE Conference on Computer Vision and Proceedings of European Conference on Computer Vision
Pattern Recognition (CVPR), 2004. 2 (ECCV), 2018. 2, 3, 5
[28] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
ers, and Arnold WM Smeulders. Selective search for ob-
ject recognition. International Journal of Computer Vision,
104(2):154–171, 2013. 2
[29] Paul Viola and Michael J Jones. Robust real-time face detec-
tion. International Journal of Computer Vision, 57(2):137–
154, 2004. 1, 2
[30] Hao Wang, Zhifeng Li, Xing Ji, and Yitong Wang. Face r-
cnn. arXiv preprint arXiv:1706.01061, 2017. 2
[31] Jianfeng Wang, Ye Yuan, and Gang Yu. Face attention net-
work: An effective face detector for the occluded faces.
arXiv preprint arXiv:1711.07246, 2017. 3, 7
[32] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian
Sun, and Chunhua Shen. Repulsion loss: Detecting pedes-
trians in a crowd. In Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018. 3
[33] Yitong Wang, Xing Ji, Zheng Zhou, Hao Wang, and Zhifeng
Li. Detecting faces using region-based fully convolutional
networks. arXiv preprint arXiv:1709.05256, 2017. 2
[34] Jian Yang, Lei Luo, Jianjun Qian, Ying Tai, Fanlong Zhang,
and Yong Xu. Nuclear norm based matrix regression with
applications to face recognition with occlusion and illumi-
nation changes. IEEE Transactions on Pattern Analysis and
Machine Intelligence (TPAMI), 39(1):156–171, 2017. 1
[35] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang.
Wider face: A face detection benchmark. In Proceedings of
IEEE Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2016. 3, 7
[36] Changzheng Zhang, Xiang Xu, and Dandan Tu. Face
detection using improved faster rcnn. arXiv preprint
arXiv:1802.02142, 2018. 2
[37] Jialiang Zhang, Xiongwei Wu, Jianke Zhu, and Steven CH
Hoi. Feature agglomeration networks for single stage face
detection. arXiv preprint arXiv:1712.00721, 2017. 2, 3
[38] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
Joint face detection and alignment using multitask cascaded
convolutional networks. IEEE Signal Processing Letters,
23(10):1499–1503, 2016. 2
[39] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo
Wang, and Stan Z Li. Sˆ 3fd: Single shot scale-invariant face
detector. In Proceedings of IEEE International Conference
on Computer Vision (ICCV), 2017. 2, 3, 5, 8

You might also like