DSFD: Dual Shot Face Detector
DSFD: Dual Shot Face Detector
DSFD: Dual Shot Face Detector
‡
Youtu Lab, Tencent
†
[email protected], {csjqian, csjyang}@njust.edu.cn
‡
{casewang, changanwang, yingtai, jasoncjwang, jerolinli, garyhuang}@tencent.com
Figure 1: Visual results. Our method is robust to various variations on scale, blurry, illumination, pose, occlusion, reflection and makeup.
1
the CNN based face detectors have being extensively stud- anchor sizes in the first shot, and use larger sizes in the
ied, detecting faces with high degree of variability in scale, second shot. Third, we propose Improved Anchor Match-
pose, occlusion, expression, appearance and illumination in ing (IAM), which integrates anchor partition strategy and
real-world scenarios remains a challenge. anchor-based data augmentation to better match anchors
Previous state-of-the-art face detectors can be roughly and ground truth faces, and thus provides better initializa-
divided into two categories. The first one is mainly based tion for the regressor. The three aspects are complementary
on the Region Proposal Network (RPN) adopted in Faster so that these techniques can work together to further im-
RCNN [24] and employs two stage detection schemes [30, prove the performance. Besides, since these techniques are
33, 36]. RPN is trained end-to-end and generates high- all related to two-stream design, we name the proposed net-
quality region proposals which are further refined by Fast work as Dual Shot Face Detector (DSFD). Fig. 1 shows the
R-CNN detector. The other one is Single Shot Detec- effectiveness of DSFD on various variations, especially on
tor (SSD) [20] based one-stage methods, which get rid of extreme small faces or heavily occluded faces.
RPN, and directly predict the bounding boxes and confi- In summary, the main contributions of this paper include:
dence [4, 27, 39]. Recently, one-stage face detection frame- • A novel Feature Enhance Module to utilize different
work has attracted more attention due to its higher inference level information and thus obtain more discriminability and
efficiency and straightforward system deployment. robustness features.
Despite the progress achieved by the above methods, • Auxiliary supervisions introduced in early layers via a
there are still some problems existed in three aspects: set of smaller anchors to effectively facilitate the features.
Feature learning Feature extraction part is essential for • An improved anchor matching strategy to match an-
a face detector. Currently, Feature Pyramid Network chors and ground truth faces as far as possible to provide
(FPN) [17] is widely used in state-of-the-art face detectors better initialization for the regressor.
for rich features. However, FPN just aggregates hierarchi- • Comprehensive experiments conducted on popular
cal feature maps between high and low-level output layers, benchmarks FDDB and WIDER FACE to demonstrate the
which does not consider the current layer’s information, and superiority of our proposed DSFD network compared with
the context relationship between anchors is ignored. the state-of-the-art methods.
Loss design The conventional loss functions used in object
detection include a regression loss for the face region and 2. Related work
a classification loss for identifying if a face is detected or We review the prior works from three perspectives.
not. To further address the class imbalance problem, Lin et Feature Learning Early works on face detection mainly
al. [18] propose Focal Loss to focus training on a sparse set rely on hand-crafted features, such as Harr-like fea-
of hard examples. To use all original and enhanced features, tures [29], control point set [1], edge orientation his-
Zhang et al. propose Hierarchical Loss to effectively learn tograms [13]. However, hand-crafted features design is lack
the network [37]. However, the above loss functions do not of guidance. With the great progress of deep learning, hand-
consider progressive learning ability of feature maps in both crafted features have been replaced by Convolutional Neu-
of different levels and shots. ral Networks (CNN). For example, Overfeat [25], Cascade-
Anchor matching Basically, pre-set anchors for each fea- CNN [14], MTCNN [38] adopt CNN as a sliding window
ture map are generated by regularly tiling a collection of detector on image pyramid to build feature pyramid. How-
boxes with different scales and aspect ratios on the image. ever, using an image pyramid is slow and memory ineffi-
Some works [27, 39] analyze a series of reasonable anchor cient. As the result, most two stage detectors extract fea-
scales and anchor compensation strategy to increase posi- tures on single scale. R-CNN [7, 8] obtains region propos-
tive anchors. However, such strategy ignores random sam- als by selective search [28], and then forwards each nor-
pling in data augmentation, which still causes imbalance be- malized image region through a CNN to classify. Faster
tween positive and negative anchors. R-CNN [24], R-FCN [5] employ Region Proposal Network
In this paper, we propose three novel techniques to ad- (RPN) to generate initial region proposals. Besides, ROI-
dress the above three issues, respectively. First, we intro- pooling [24] and position-sensitive RoI pooling [5] are ap-
duce a Feature Enhance Module (FEM) to enhance the dis- plied to extract features from each region.
criminability and robustness of the features, which com- More recently, some research indicates that multi-scale
bines the advantages of the FPN in PyramidBox and Re- features perform better for tiny objects. Specifically,
ceptive Field Block (RFB) in RFBNet [19]. Second, moti- SSD [20], MS-CNN [2], SSH [23], S3FD [39] predict
vated by the hierarchical loss [37] and pyramid anchor [27] boxes on multiple layers of feature hierarchy. FCN [22],
in PyramidBox, we design Progressive Anchor Loss (PAL) Hypercolumns [9], Parsenet [21] fuse multiple layer fea-
that uses progressive anchor sizes for not only different lev- tures in segmentation. FPN [15, 17], a top-down architec-
els, but also different shots. Specifically, we assign smaller ture, integrate high-level semantic information to all scales.
(a) Original Feature Shot
Input Image conv3_3 conv4_3 conv5_3 conv_fc7 conv6_2 conv7_2
Table 6: FEM vs. RFB on WIDER FACE. Comparison with RFB Our FEM differs from RFB in two
Backbone - ResNet101 (%) Easy Medium Hard
DSFD (RFB) 96.0 94.5 87.2
aspects. First, our FEM is based on FPN to make full use of
DSFD (FPN) / (FPN+RFB) 96.2 / 96.2 95.1 / 95.3 89.7 / 89.9 feature information from different spatial levels, while RFB
DSFD (FEM) 96.3 95.4 90.1 ignores. Second, our FEM adopts stacked dilation convolu-
tions in a multi-branch structure, which efficiently leads to
Besides, Fig. 4 shows that our improved anchor match- larger Receptive Fields (RF) than RFB that only uses one
ing strategy greatly increases the number of ground truth dilation layer in each branch, e.g., R3 in FEM compared to
faces that are closed to the anchor, which can reduce the R in RFB where indicates the RF of one dilation convolu-
contradiction between the discrete anchor scales and con- tion. Tab. 6 clearly demonstrates the superiority of our FEM
tinuous face scales. Moreover, Fig. 5 shows the number dis- over RFB, even when RFB is equipped with FPN.
tribution of matched anchor number for ground truth faces, From the above analysis and results, some promising
which indicates our improved anchor matching can signif- conclusions can be drawn: 1) Feature enhance is crucial.
icantly increase the matched anchor number, and the aver- We use a more robust and discriminative feature enhance
aged number of matched anchor for different scales of faces module to improve the feature presentation ability, espe-
can be improved from 6.4 to about 6.9. cially for hard face. 2) Auxiliary loss based on progressive
Discontinous ROC curves Continous ROC curves
anchor is used to train all 12 different scale detection feature For VGA resolution inputs to Res50-based DSFD, it runs
maps, and it improves the performance on easy, medium 22 FPS on NVIDA GPU P40 during inference.
and hard faces simultaneously. 3) Our improved anchor
matching provides better initial anchors and ground-truth 4.3. Comparisons with State-of-the-Art Methods
faces to regress anchor from faces, which achieves the im-
We evaluate the proposed DSFD on two popular face
provements of 0.3%, 0.1%, 0.3% on three settings, respec-
detection benchmarks, including WIDER FACE [35] and
tively. Additionally, when we enlarge the training batch size
Face Detection Data Set and Benchmark (FDDB) [12]. Our
(i.e., LargeBS), the result in hard setting can get 91.2% AP.
model is trained only using the training set of WIDER
Effects of Different Backbones To better understand FACE, and then evaluated on both benchmarks without any
our DSFD, we further conducted experiments to examine further fine-tuning. We also follow the similar way used
how different backbones affect classification and detection in [31] to build the image pyramid for multi-scale testing
performance. Specifically, we use the same setting ex- and use more powerful backbone similar as [4].
cept for the feature extraction network, we implement SE- WIDER FACE Dataset It contains 393, 703 annotated
ResNet101, DPN−98, SE-ResNeXt101 32×4d following faces with large variations in scale, pose and occlusion in
the ResNet101 setting in our DSFD. From Table 5, DSFD total 32, 203 images. For each of the 60 event classes, 40%,
with SE-ResNeXt101 32×4d got 95.7%, 94.8%, 88.9%, on 10%, 50% images of the database are randomly selected
easy, medium and hard settings respectively, which indi- as training, validation and testing sets. Besides, each sub-
cates that more complexity model and higher Top-1 Ima- set is further defined into three levels of difficulty: ’Easy’,
geNet classification accuracy may not benefit face detection ’Medium’, ’Hard’ based on the detection rate of a baseline
AP. Therefore, in our DSFD framework, better performance detector. As shown in Fig. 6, our DSFD achieves the best
on classification are not necessary for better performance performance among all of the state-of-the-art face detectors
on detection, which is consistent to the conclusion claimed based on the average precision (AP) across the three sub-
in [11, 16]. Our DSFD enjoys high inference speed bene- sets, i.e., 96.6% (Easy), 95.7% (Medium) and 90.4% (Hard)
fited from simply using the second shot detection results. on validation set, and 96.0% (Easy), 95.3% (Medium) and
Scale Pose Occlusion Blurry
Figure 8: Illustration of our DSFD to various large variations on scale, pose, occlusion, blurry, makeup, illumination, modality and
reflection. Blue bounding boxes indicate the detector confidence is above 0.8.