Feature Pyramid Networks For Object Detection
Feature Pyramid Networks For Object Detection
Feature Pyramid Networks For Object Detection
predict
predict
Feature pyramids are a basic component in recognition
systems for detecting objects at different scales. But recent predict
deep learning object detectors have avoided pyramid rep- (a) Featurized image pyramid (b) Single feature map
resentations, in part because they are compute and memory
intensive. In this paper, we exploit the inherent multi-scale, predict
predict
pyramidal hierarchy of deep convolutional networks to con- predict
predict
struct feature pyramids with marginal extra cost. A top- predict
predict
down architecture with lateral connections is developed for
building high-level semantic feature maps at all scales. This
(c) Pyramidal feature hierarchy (d) Feature Pyramid Network
architecture, called a Feature Pyramid Network (FPN),
shows significant improvement as a generic feature extrac- Figure 1. (a) Using an image pyramid to build a feature pyramid.
Features are computed on each of the image scales independently,
tor in several applications. Using FPN in a basic Faster
which is slow. (b) Recent detection systems have opted to use
R-CNN system, our method achieves state-of-the-art single-
only single scale features for faster detection. (c) An alternative is
model results on the COCO detection benchmark without to reuse the pyramidal feature hierarchy computed by a ConvNet
bells and whistles, surpassing all existing single-model en- as if it were a featurized image pyramid. (d) Our proposed Feature
tries including those from the COCO 2016 challenge win- Pyramid Network (FPN) is fast like (b) and (c), but more accurate.
ners. In addition, our method can run at 6 FPS on a GPU In this figure, feature maps are indicate by blue outlines and thicker
and thus is a practical and accurate solution to multi-scale outlines denote semantically stronger features.
object detection. Code will be made publicly available.
1
networks end-to-end on an image pyramid is infeasible in
terms of memory, and so, if exploited, image pyramids are
used only at test time [15, 11, 16, 35], which creates an
inconsistency between train/test-time inference. For these predict
reasons, Fast and Faster R-CNN [11, 29] opt to not use fea-
turized image pyramids under default settings.
However, image pyramids are not the only way to com-
predict
pute a multi-scale feature representation. A deep ConvNet
computes a feature hierarchy layer by layer, and with sub- predict
sampling layers the feature hierarchy has an inherent multi- predict
scale, pyramidal shape. This in-network feature hierarchy
produces feature maps of different spatial resolutions, but
introduces large semantic gaps caused by different depths. Figure 2. Top: a top-down architecture with skip connections,
where predictions are made on the finest level (e.g., [28]). Bottom:
The high-resolution maps have low-level features that harm
our model that has a similar structure but leverages it as a feature
their representational capacity for object recognition. pyramid, with predictions made independently at all levels.
The Single Shot Detector (SSD) [22] is one of the first
attempts at using a ConvNet’s pyramidal feature hierarchy
as if it were a featurized image pyramid (Fig. 1(c)). Ideally, a basic Faster R-CNN detector [29], surpassing all exist-
the SSD-style pyramid would reuse the multi-scale feature ing heavily-engineered single-model entries of competition
maps from different layers computed in the forward pass winners. In ablation experiments, we find that for bound-
and thus come free of cost. But to avoid using low-level ing box proposals, FPN significantly increases the Average
features SSD foregoes reusing already computed layers and Recall (AR) by 8.0 points; for object detection, it improves
instead builds the pyramid starting from high up in the net- the COCO-style Average Precision (AP) by 2.3 points and
work (e.g., conv4 3 of VGG nets [36]) and then by adding PASCAL-style AP by 3.8 points, over a strong single-scale
several new layers. Thus it misses the opportunity to reuse baseline of Faster R-CNN on ResNets [16]. Our method is
the higher-resolution maps of the feature hierarchy. We also easily extended to mask proposals and improves both
show that these are important for detecting small objects. instance segmentation AR and speed over state-of-the-art
The goal of this paper is to naturally leverage the pyra- methods that heavily depend on image pyramids.
midal shape of a ConvNet’s feature hierarchy while cre- In addition, our pyramid structure can be trained end-to-
ating a feature pyramid that has strong semantics at all end with all scales and is used consistently at train/test time,
scales. To achieve this goal, we rely on an architecture that which would be memory-infeasible using image pyramids.
combines low-resolution, semantically strong features with As a result, FPNs are able to achieve higher accuracy than
high-resolution, semantically weak features via a top-down all existing state-of-the-art methods. Moreover, this im-
pathway and lateral connections (Fig. 1(d)). The result is provement is achieved without increasing testing time over
a feature pyramid that has rich semantics at all levels and the single-scale baseline. We believe these advances will
is built quickly from a single input image scale. In other facilitate future research and applications. Our code will be
words, we show how to create in-network feature pyramids made publicly available.
that can be used to replace featurized image pyramids with-
out sacrificing representational power, speed, or memory. 2. Related Work
Similar architectures adopting top-down and skip con- Hand-engineered features and early neural networks.
nections are popular in recent research [28, 17, 8, 26]. Their SIFT features [25] were originally extracted at scale-space
goals are to produce a single high-level feature map of a fine extrema and used for feature point matching. HOG fea-
resolution on which the predictions are to be made (Fig. 2 tures [5], and later SIFT features as well, were computed
top). On the contrary, our method leverages the architecture densely over entire image pyramids. These HOG and SIFT
as a feature pyramid where predictions (e.g., object detec- pyramids have been used in numerous works for image
tions) are independently made on each level (Fig. 2 bottom). classification, object detection, human pose estimation, and
Our model echoes a featurized image pyramid, which has more. There has also been significant interest in comput-
not been explored in these works. ing featurized image pyramids quickly. Dollár et al. [6]
We evaluate our method, called a Feature Pyramid Net- demonstrated fast pyramid computation by first computing
work (FPN), in various systems for detection and segmen- a sparsely sampled (in scale) pyramid and then interpolat-
tation [11, 29, 27]. Without bells and whistles, we re- ing missing levels. Before HOG and SIFT, early work on
port a state-of-the-art single-model result on the challenging face detection with ConvNets [38, 32] computed shallow
COCO detection benchmark [21] simply based on FPN and networks over image pyramids to detect faces across scales.
2
Deep ConvNet object detectors. With the development predict
of modern deep ConvNets [19], object detectors like Over- predict
Feat [34] and R-CNN [12] showed dramatic improvements predict
in accuracy. OverFeat adopted a strategy similar to early
neural network face detectors by applying a ConvNet as
a sliding window detector on an image pyramid. R-CNN
adopted a region proposal-based strategy [37] in which each 2x up
3
pled map is then merged with the corresponding bottom-up anchors on a specific level. Instead, we assign anchors of
map (which undergoes a 1×1 convolutional layer to reduce a single scale to each level. Formally, we define the an-
channel dimensions) by element-wise addition. This pro- chors to have areas of {322 , 642 , 1282 , 2562 , 5122 } pixels
cess is iterated until the finest resolution map is generated. on {P2 , P3 , P4 , P5 , P6 } respectively.1 As in [29] we also
To start the iteration, we simply attach a 1×1 convolutional use anchors of multiple aspect ratios {1:2, 1:1, 2:1} at each
layer on C5 to produce the coarsest resolution map. Fi- level. So in total there are 15 anchors over the pyramid.
nally, we append a 3×3 convolution on each merged map to We assign training labels to the anchors based on
generate the final feature map, which is to reduce the alias- their Intersection-over-Union (IoU) ratios with ground-truth
ing effect of upsampling. This final set of feature maps is bounding boxes as in [29]. Formally, an anchor is assigned
called {P2 , P3 , P4 , P5 }, corresponding to {C2 , C3 , C4 , C5 } a positive label if it has the highest IoU for a given ground-
that are respectively of the same spatial sizes. truth box or an IoU over 0.7 with any ground-truth box,
Because all levels of the pyramid use shared classi- and a negative label if it has IoU lower than 0.3 for all
fiers/regressors as in a traditional featurized image pyramid, ground-truth boxes. Note that scales of ground-truth boxes
we fix the feature dimension (numbers of channels, denoted are not explicitly used to assign them to the levels of the
as d) in all the feature maps. We set d = 256 in this pa- pyramid; instead, ground-truth boxes are associated with
per and thus all extra convolutional layers have 256-channel anchors, which have been assigned to pyramid levels. As
outputs. There are no non-linearities in these extra layers, such, we introduce no extra rules in addition to those in [29].
which we have empirically found to have minor impacts. We note that the parameters of the heads are shared
Simplicity is central to our design and we have found that across all feature pyramid levels; we have also evaluated the
our model is robust to many design choices. We have exper- alternative without sharing parameters and observed similar
imented with more sophisticated blocks (e.g., using multi- accuracy. The good performance of sharing parameters in-
layer residual blocks [16] as the connections) and observed dicates that all levels of our pyramid share similar semantic
marginally better results. Designing better connection mod- levels. This advantage is analogous to that of using a fea-
ules is not the focus of this paper, so we opt for the simple turized image pyramid, where a common head classifier can
design described above. be applied to features computed at any image scale.
With the above adaptations, RPN can be naturally trained
4. Applications and tested with our FPN, in the same fashion as in [29]. We
elaborate on the implementation details in the experiments.
Our method is a generic solution for building feature
pyramids inside deep ConvNets. In the following we adopt 4.2. Feature Pyramid Networks for Fast R-CNN
our method in RPN [29] for bounding box proposal gen-
eration and in Fast R-CNN [11] for object detection. To Fast R-CNN [11] is a region-based object detector in
demonstrate the simplicity and effectiveness of our method, which Region-of-Interest (RoI) pooling is used to extract
we make minimal modifications to the original systems of features. Fast R-CNN is most commonly performed on a
[29, 11] when adapting them to our feature pyramid. single-scale feature map. To use it with our FPN, we need
to assign RoIs of different scales to the pyramid levels.
4.1. Feature Pyramid Networks for RPN We view our feature pyramid as if it were produced from
an image pyramid. Thus we can adapt the assignment strat-
RPN [29] is a sliding-window class-agnostic object de-
egy of region-based detectors [15, 11] in the case when they
tector. In the original RPN design, a small subnetwork is
are run on image pyramids. Formally, we assign an RoI of
evaluated on dense 3×3 sliding windows, on top of a single-
width w and height h (on the input image to the network) to
scale convolutional feature map, performing object/non-
the level Pk of our feature pyramid by:
object binary classification and bounding box regression.
√
This is realized by a 3×3 convolutional layer followed by k = bk0 + log2 ( wh/224)c. (1)
two sibling 1×1 convolutions for classification and regres-
sion, which we refer to as a network head. The object/non- Here 224 is the canonical ImageNet pre-training size, and
object criterion and bounding box regression target are de- k0 is the target level on which an RoI with w × h = 2242
fined with respect to a set of reference boxes called anchors should be mapped into. Analogous to the ResNet-based
[29]. The anchors are of multiple pre-defined scales and Faster R-CNN system [16] that uses C4 as the single-scale
aspect ratios in order to cover objects of different shapes. feature map, we set k0 to 4. Intuitively, Eqn. (1) means
We adapt RPN by replacing the single-scale feature map that if the RoI’s scale becomes smaller (say, 1/2 of 224), it
with our FPN. We attach a head of the same design (3×3 should be mapped into a finer-resolution level (say, k = 3).
conv and two sibling 1×1 convs) to each level on our feature 1 Here we introduce P only for covering a larger anchor scale of 5122 .
6
pyramid. Because the head slides densely over all locations P6 is simply a stride two subsampling of P5 . P6 is not used by the Fast
in all pyramid levels, it is not necessary to have multi-scale R-CNN detector in the next section.
4
We attach predictor heads (in Fast R-CNN the heads are 5.1.1 Ablation Experiments
class-specific classifiers and bounding box regressors) to all
Comparisons with baselines. For fair comparisons with
RoIs of all levels. Again, the heads all share parameters,
original RPNs [29], we run two baselines (Table 1(a, b)) us-
regardless of their levels. In [16], a ResNet’s conv5 lay-
ing the single-scale map of C4 (the same as [16]) or C5 , both
ers (a 9-layer deep subnetwork) are adopted as the head on
using the same hyper-parameters as ours, including using 5
top of the conv4 features, but our method has already har-
scale anchors of {322 , 642 , 1282 , 2562 , 5122 }. Table 1 (b)
nessed conv5 to construct the feature pyramid. So unlike
shows no advantage over (a), indicating that a single higher-
[16], we simply adopt RoI pooling to extract 7×7 features,
level feature map is not enough because there is a trade-off
and attach two hidden 1,024-d fully-connected (fc) layers
between coarser resolutions and stronger semantics.
(each followed by ReLU) before the final classification and
Placing FPN in RPN improves AR1k to 56.3 (Table 1
bounding box regression layers. These layers are randomly
(c)), which is 8.0 points increase over the single-scale RPN
initialized, as there are no pre-trained fc layers available in
baseline (Table 1 (a)). In addition, the performance on small
ResNets. Note that compared to the standard conv5 head,
objects (AR1ks ) is boosted by a large margin of 12.9 points.
our 2-fc MLP head is lighter weight and faster.
Our pyramid representation greatly improves RPN’s robust-
Based on these adaptations, we can train and test Fast R-
ness to object scale variation.
CNN on top of the feature pyramid. Implementation details
are given in the experimental section. How important is top-down enrichment? Table 1(d)
shows the results of our feature pyramid without the top-
5. Experiments on Object Detection down pathway. With this modification, the 1×1 lateral con-
nections followed by 3×3 convolutions are attached to the
We perform experiments on the 80 category COCO de- bottom-up pyramid. This architecture simulates the effect
tection dataset [21]. We train using the union of 80k train of reusing the pyramidal feature hierarchy (Fig. 1(b)).
images and a 35k subset of val images (trainval35k The results in Table 1(d) are just on par with the RPN
[2]), and report ablations on a 5k subset of val images baseline and lag far behind ours. We conjecture that this
(minival). We also report final results on the standard is because there are large semantic gaps between different
test set (test-std) [21] which has no disclosed labels. levels on the bottom-up pyramid (Fig. 1(b)), especially for
As is common practice [12], all network backbones very deep ResNets. We have also evaluated a variant of Ta-
are pre-trained on the ImageNet1k classification set [33] ble 1(d) without sharing the parameters of the heads, but
and then fine-tuned on the detection dataset. We use the observed similarly degraded performance. This issue can-
pre-trained ResNet-50 and ResNet-101 models that are not be simply remedied by level-specific heads.
publicly available.2 Our code is a reimplementation of
How important are lateral connections? Table 1(e)
py-faster-rcnn3 using Caffe2.4
shows the ablation results of a top-down feature pyramid
5.1. Region Proposal with RPN without the 1×1 lateral connections. This top-down pyra-
mid has strong semantic features and fine resolutions. But
We evaluate the COCO-style Average Recall (AR) and we argue that the locations of these features are not precise,
AR on small, medium, and large objects (ARs , ARm , and because these maps have been downsampled and upsampled
ARl ) following the definitions in [21]. We report results for several times. More precise locations of features can be di-
100 and 1000 proposals per images (AR100 and AR1k ). rectly passed from the finer levels of the bottom-up maps via
Implementation details. All architectures in Table 1 are the lateral connections to the top-down maps. As a results,
trained end-to-end. The input image is resized such that its FPN has an AR1k score 10 points higher than Table 1(e).
shorter side has 800 pixels. We adopt synchronized SGD How important are pyramid representations? Instead
training on 8 GPUs. A mini-batch involves 2 images per of resorting to pyramid representations, one can attach the
GPU and 256 anchors per image. We use a weight decay of head to the highest-resolution, strongly semantic feature
0.0001 and a momentum of 0.9. The learning rate is 0.02 for maps of P2 (i.e., the finest level in our pyramids). Simi-
the first 30k mini-batches and 0.002 for the next 10k. For lar to the single-scale baselines, we assign all anchors to the
all RPN experiments (including baselines), we include the P2 feature map. This variant (Table 1(f)) is better than the
anchor boxes that are outside the image for training, which baseline but inferior to our approach. RPN is a sliding win-
is unlike [29] where these anchor boxes are ignored. Other dow detector with a fixed window size, so scanning over
implementation details are as in [29]. Training RPN with pyramid levels can increase its robustness to scale variance.
FPN on 8 GPUs takes about 8 hours on COCO. In addition, we note that using P2 alone leads to more
2 https://github.com/kaiminghe/deep-residual-networks anchors (750k, Table 1(f)) caused by its large spatial reso-
3 https://github.com/rbgirshick/py-faster-rcnn lution. This result suggests that a larger number of anchors
4 https://github.com/caffe2/caffe2 is not sufficient in itself to improve accuracy.
5
RPN feature # anchors lateral? top-down? AR100 AR1k AR1k
s AR1k
m AR1k
l
(a) baseline on conv4 C4 47k 36.1 48.3 32.0 58.7 62.2
(b) baseline on conv5 C5 12k 36.3 44.9 25.3 55.5 64.2
(c) FPN {Pk } 200k X X 44.0 56.3 44.9 63.4 66.2
Ablation experiments follow:
(d) bottom-up pyramid {Pk } 200k X 37.4 49.5 30.5 59.9 68.0
(e) top-down pyramid, w/o lateral {Pk } 200k X 34.5 46.1 26.5 57.4 64.7
(f) only finest level P2 750k X X 38.4 51.3 35.1 59.7 67.6
Table 1. Bounding box proposal results using RPN [29], evaluated on the COCO minival set. All models are trained on trainval35k.
The columns “lateral” and “top-down” denote the presence of lateral and top-down connections, respectively. The column “feature” denotes
the feature maps on which the heads are attached. All results are based on ResNet-50 and share the same hyper-parameters.
Fast R-CNN proposals feature head lateral? top-down? [email protected] AP APs APm APl
(a) baseline on conv4 RPN, {Pk } C4 conv5 54.7 31.9 15.7 36.5 45.5
(b) baseline on conv5 RPN, {Pk } C5 2fc 52.9 28.8 11.9 32.4 43.4
(c) FPN RPN, {Pk } {Pk } 2fc X X 56.9 33.9 17.8 37.7 45.8
Ablation experiments follow:
(d) bottom-up pyramid RPN, {Pk } {Pk } 2fc X 44.9 24.9 10.9 24.4 38.5
(e) top-down pyramid, w/o lateral RPN, {Pk } {Pk } 2fc X 54.0 31.3 13.3 35.2 45.3
(f) only finest level RPN, {Pk } P2 2fc X X 56.3 33.4 17.3 37.3 45.6
Table 2. Object detection results using Fast R-CNN [11] on a fixed set of proposals (RPN, {Pk }, Table 1(c)), evaluated on the COCO
minival set. Models are trained on the trainval35k set. All results are based on ResNet-50 and share the same hyper-parameters.
Faster R-CNN proposals feature head lateral? top-down? [email protected] AP APs APm APl
(*) baseline from He et al. [16]† RPN, C4 C4 conv5 47.3 26.3 - - -
(a) baseline on conv4 RPN, C4 C4 conv5 53.1 31.6 13.2 35.6 47.1
(b) baseline on conv5 RPN, C5 C5 2fc 51.7 28.0 9.6 31.9 43.1
(c) FPN RPN, {Pk } {Pk } 2fc X X 56.9 33.9 17.8 37.7 45.8
Table 3. Object detection results using Faster R-CNN [29] evaluated on the COCO minival set. The backbone network for RPN are
consistent with Fast R-CNN. Models are trained on the trainval35k set and use ResNet-50. † Provided by authors of [16].
5.2. Object Detection with Fast/Faster R-CNN puted by RPN on FPN (Table 1(c)), because it has good per-
formance on small objects that are to be recognized by the
Next we investigate FPN for region-based (non-sliding
detector. For simplicity we do not share features between
window) detectors. We evaluate object detection by the
Fast R-CNN and RPN, except when specified.
COCO-style Average Precision (AP) and PASCAL-style
As a ResNet-based Fast R-CNN baseline, following
AP (at a single IoU threshold of 0.5). We also report COCO
[16], we adopt RoI pooling with an output size of 14×14
AP on objects of small, medium, and large sizes (namely,
and attach all conv5 layers as the hidden layers of the head.
APs , APm , and APl ) following the definitions in [21].
This gives an AP of 31.9 in Table 2(a). Table 2(b) is a base-
Implementation details. The input image is resized such line exploiting an MLP head with 2 hidden fc layers, similar
that its shorter side has 800 pixels. Synchronized SGD is to the head in our architecture. It gets an AP of 28.8, indi-
used to train the model on 8 GPUs. Each mini-batch in- cating that the 2-fc head does not give us any orthogonal
volves 2 image per GPU and 512 RoIs per image. We use advantage over the baseline in Table 2(a).
a weight decay of 0.0001 and a momentum of 0.9. The Table 2(c) shows the results of our FPN in Fast R-CNN.
learning rate is 0.02 for the first 60k mini-batches and 0.002 Comparing with the baseline in Table 2(a), our method im-
for the next 20k. We use 2000 RoIs per image for training proves AP by 2.0 points and small object AP by 2.1 points.
and 1000 for testing. Training Fast R-CNN with FPN takes Comparing with the baseline that also adopts a 2fc head (Ta-
about 10 hours on the COCO dataset. ble 2(b)), our method improves AP by 5.1 points.5 These
comparisons indicate that our feature pyramid is superior to
5.2.1 Fast R-CNN (on fixed proposals) single-scale features for a region-based object detector.
Table 2(d) and (e) show that removing top-down con-
To better investigate FPN’s effects on the region-based de-
tector alone, we conduct ablations of Fast R-CNN on a fixed 5 We expect a stronger architecture of the head [30] will improve upon
set of proposals. We choose to freeze the proposals as com- our results, which is beyond the focus of this paper.
6
image test-dev test-std
method backbone competition pyramid [email protected] AP APs APm APl [email protected] AP APs APm APl
ours, Faster R-CNN on FPN ResNet-101 - 59.1 36.2 18.2 39.0 48.2 58.5 35.8 17.5 38.7 47.8
Competition-winning single-model results follow:
G-RMI† Inception-ResNet 2016 - 34.7 - - - - - - - -
AttractioNet‡ [10] VGG16 + Wide ResNet§ 2016 X 53.4 35.7 15.6 38.0 52.7 52.9 35.3 14.7 37.6 51.9
Faster R-CNN +++ [16] ResNet-101 2015 X 55.7 34.9 15.6 38.7 50.9 - - - - -
Multipath [40] (on minival) VGG-16 2015 49.6 31.5 - - - - - - - -
ION‡ [2] VGG-16 2015 53.4 31.2 12.8 32.9 45.2 52.9 30.7 11.8 32.8 44.8
Table 4. Comparisons of single-model results on the COCO detection benchmark. Some results were not available on the test-std
set, so we also include the test-dev results (and for Multipath [40] on minival). † : http://image-net.org/challenges/
talks/2016/GRMI-COCO-slidedeck.pdf. ‡ : http://mscoco.org/dataset/#detections-leaderboard. § : This
entry of AttractioNet [10] adopts VGG-16 for proposals and Wide ResNet [39] for object detection, so is not strictly a single-model result.
7
image pyramid AR ARs ARm ARl time (s)
5x5 14x14
320x320 [256x256]
DeepMask [27] X 37.1 15.8 50.1 54.9 0.49
5x5
SharpMask [28] X 39.8 17.4 53.1 59.1 0.77
14x14 InstanceFCN [4] X 39.2 – – – 1.50†
160x160 [128x128]
FPN Mask Results:
5x5 14x14 80x80 [64x64]
single MLP [5×5] 43.4 32.5 49.2 53.7 0.15
single MLP [7×7] 43.5 30.0 49.6 57.8 0.19
Figure 4. FPN for object segment proposals. The feature pyramid dual MLP [5×5, 7×7] 45.7 31.9 51.5 60.8 0.24
is constructed with identical structure as for object detection. We + 2x mask resolution 46.7 31.7 53.1 63.2 0.25
apply a small MLP on 5×5 windows to generate dense object seg- + 2x train schedule 48.1 32.6 54.2 65.6 0.25
ments with output dimension of 14×14. Shown in orange are the
Table 6. Instance segmentation proposals evaluated on the first 5k
size of the image regions the mask corresponds to for each pyra-
COCO val images. All models are trained on the train set.
mid level (levels P3−5 are shown here). Both the corresponding
DeepMask, SharpMask, and FPN use ResNet-50 while Instance-
image region size (light orange) and canonical object size (dark
FCN uses VGG-16. DeepMask and SharpMask performance
orange) are shown.
√ Half octaves are handled by an MLP on 7x7 is computed with models available from https://github.
windows (7 ≈ 5 2), not shown here. Details are in the appendix.
com/facebookresearch/deepmask (both are the ‘zoom’
variants). † Runtimes are measured on an NVIDIA M40 GPU, ex-
On the test-dev set, our method increases over the ex- cept the InstanceFCN timing which is based on the slower K40.
isting best results by 0.5 points of AP (36.2 vs. 35.7) and
3.4 points of [email protected] (59.1 vs. 55.7). It is worth noting that
our method does not rely on image pyramids and only uses 6.1. Segmentation Proposal Results
a single input image scale, but still has outstanding AP on Results are shown in Table 6. We report segment AR and
small-scale objects. This could only be achieved by high- segment AR on small, medium, and large objects, always
resolution image inputs with previous methods. for 1000 proposals. Our baseline FPN model with a single
Moreover, our method does not exploit many popular 5×5 MLP achieves an AR of 43.4. Switching to a slightly
improvements, such as iterative regression [9], hard nega- larger 7×7 MLP leaves accuracy largely unchanged. Using
tive mining [35], context modeling [16], stronger data aug- both MLPs together increases accuracy to 45.7 AR. Increas-
mentation [22], etc. These improvements are complemen- ing mask output size from 14×14 to 28×28 increases AR
tary to FPNs and should boost accuracy further. another point (larger sizes begin to degrade accuracy). Fi-
Recently, FPN has enabled new top results in all tracks nally, doubling the training iterations increases AR to 48.1.
of the COCO competition, including detection, instance We also report comparisons to DeepMask [27], Sharp-
segmentation, and keypoint estimation. See [14] for details. Mask [28], and InstanceFCN [4], the previous state of the
art methods in mask proposal generation. We outperform
6. Extensions: Segmentation Proposals the accuracy of these approaches by over 8.3 points AR. In
particular, we nearly double the accuracy on small objects.
Our method is a generic pyramid representation and can
Existing mask proposal methods [27, 28, 4] are based on
be used in applications other than object detection. In this
densely sampled image pyramids (e.g., scaled by 2{−2:0.5:1}
section we use FPNs to generate segmentation proposals,
in [27, 28]), making them computationally expensive. Our
following the DeepMask/SharpMask framework [27, 28].
approach, based on FPNs, is substantially faster (our mod-
DeepMask/SharpMask were trained on image crops for
els run at 6 to 7 FPS). These results demonstrate that our
predicting instance segments and object/non-object scores.
model is a generic feature extractor and can replace image
At inference time, these models are run convolutionally to
pyramids for other multi-scale detection problems.
generate dense proposals in an image. To generate segments
at multiple scales, image pyramids are necessary [27, 28].
7. Conclusion
It is easy to adapt FPN to generate mask proposals. We
use a fully convolutional setup for both training and infer- We have presented a clean and simple framework for
ence. We construct our feature pyramid as in Sec. 5.1 and building feature pyramids inside ConvNets. Our method
set d = 128. On top of each level of the feature pyramid, we shows significant improvements over several strong base-
apply a small 5×5 MLP to predict 14×14 masks and object lines and competition winners. Thus, it provides a practical
scores in a fully convolutional fashion, see Fig. 4. Addition- solution for research and applications of feature pyramids,
ally, motivated by the use of 2 scales per octave in the image without the need of computing image pyramids. Finally,
pyramid of [27, 28], we use a second MLP of input size 7×7 our study suggests that despite the strong representational
to handle half octaves. The two MLPs play a similar role as power of deep ConvNets and their implicit robustness to
anchors in RPN. The architecture is trained end-to-end; full scale variation, it is still critical to explicitly address multi-
implementation details are given in the appendix. scale problems using pyramid representations.
8
A. Implementation of Segmentation Proposals a positive/negative sampling ratio of 1:3. The mask loss is
given 10× higher weight than the score loss. This model is
We use our feature pyramid networks to efficiently gen- trained end-to-end on 8 GPUs using synchronized SGD (2
erate object segment proposals, adopting an image-centric images per GPU). We start with a learning rate of 0.03 and
training strategy popular for object detection [11, 29]. Our train for 80k mini-batches, dividing the learning rate by 10
FPN mask generation model inherits many of the ideas and after 60k mini-batches. The image scale is set to 800 pixels
motivations from DeepMask/SharpMask [27, 28]. How- during training and testing (we do not use scale jitter). Dur-
ever, in contrast to these models, which were trained on ing inference our fully-convolutional model predicts scores
image crops and used a densely sampled image pyramid for at all positions and scales and masks at the 1000 highest
inference, we perform fully-convolutional training for mask scoring locations. We do not perform any non-maximum
prediction on a feature pyramid. While this requires chang- suppression or post-processing.
ing many of the specifics, our implementation remains sim-
ilar in spirit to DeepMask. Specifically, to define the label
of a mask instance at each sliding window, we think of this References
window as being a crop on the input image, allowing us [1] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and
to inherit definitions of positives/negatives from DeepMask. J. M. Ogden. Pyramid methods in image processing. RCA
We give more details next, see also Fig. 4 for a visualization. engineer, 1984.
We construct the feature pyramid with P2−6 using the [2] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-
same architecture as described in Sec. 5.1. We set d = 128. outside net: Detecting objects in context with skip pooling
Each level of our feature pyramid is used for predicting and recurrent neural networks. In CVPR, 2016.
masks at a different scale. As in DeepMask, we define [3] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified
the scale of a mask as the max of its width and height. multi-scale deep convolutional neural network for fast object
Masks with scales of {32, 64, 128, 256, 512} pixels map detection. In ECCV, 2016.
to {P2 , P3 , P4 , P5 , P6 }, respectively, and are handled by a [4] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. Instance-sensitive
5×5 MLP. As DeepMask uses a pyramid with half octaves, fully convolutional networks. In ECCV, 2016.
√
we use a second slightly larger MLP of size 7×7 (7√≈ 5 2) [5] N. Dalal and B. Triggs. Histograms of oriented gradients for
to handle half-octaves in our model (e.g., a 128 2 scale human detection. In CVPR, 2005.
mask is predicted by the 7×7 MLP on P4 ). Objects at inter- [6] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature
mediate scales are mapped to the nearest scale in log space. pyramids for object detection. TPAMI, 2014.
As the MLP must predict objects at a range of scales for [7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-
each pyramid level (specifically a half octave range), some manan. Object detection with discriminatively trained part-
based models. TPAMI, 2010.
padding must be given around the canonical object size. We
use 25% padding. This means that the mask output over [8] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid reconstruc-
tion and refinement for semantic segmentation. In ECCV,
{P2 , P3 , P4 , P5 , P6 } maps to {40, 80, 160,√320, 640} sized
2016.
image regions for the 5×5 MLP (and to 2 larger corre-
[9] S. Gidaris and N. Komodakis. Object detection via a multi-
sponding sizes for the 7×7 MLP).
region & semantic segmentation-aware CNN model. In
Each spatial position in the feature map is used to pre- ICCV, 2015.
dict a mask at a different location. Specifically, at scale Pk ,
[10] S. Gidaris and N. Komodakis. Attend refine repeat: Active
each spatial position in the feature map is used to predict box proposal generation via in-out localization. In BMVC,
the mask whose center falls within 2k pixels of that loca- 2016.
tion (corresponding to ±1 cell offset in the feature map). If [11] R. Girshick. Fast R-CNN. In ICCV, 2015.
no object center falls within this range, the location is con-
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-
sidered a negative, and, as in DeepMask, is used only for ture hierarchies for accurate object detection and semantic
training the score branch and not the mask branch. segmentation. In CVPR, 2014.
The MLP we use for predicting the mask and score is [13] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hyper-
fairly simple. We apply a 5×5 kernel with 512 outputs, fol- columns for object segmentation and fine-grained localiza-
lowed by sibling fully connected layers to predict a 14×14 tion. In CVPR, 2015.
mask (142 outputs) and object score (1 output). The model [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
is implemented in a fully convolutional manner (using 1×1 arXiv:1703.06870, 2017.
convolutions in place of fully connected layers). The 7×7 [15] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling
MLP for handling objects at half octave scales is identical in deep convolutional networks for visual recognition. In
to the 5×5 MLP except for its larger input region. ECCV. 2014.
During training, we randomly sample 2048 examples per [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
mini-batch (128 examples per image from 16 images) with for image recognition. In CVPR, 2016.
9
[17] S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator [38] R. Vaillant, C. Monrocq, and Y. LeCun. Original approach
networks: Learning coarse-to-fine feature aggregation. In for the localisation of objects in images. IEE Proc. on Vision,
CVPR, 2016. Image, and Signal Processing, 1994.
[18] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards ac- [39] S. Zagoruyko and N. Komodakis. Wide residual networks.
curate region proposal generation and joint object detection. In BMVC, 2016.
In CVPR, 2016. [40] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross,
[19] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet clas- S. Chintala, and P. Dollár. A multipath network for object
sification with deep convolutional neural networks. In NIPS, detection. In BMVC, 2016.
2012.
[20] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
Howard, W. Hubbard, and L. D. Jackel. Backpropagation
applied to handwritten zip code recognition. Neural compu-
tation, 1989.
[21] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
mon objects in context. In ECCV, 2014.
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed.
SSD: Single shot multibox detector. In ECCV, 2016.
[23] W. Liu, A. Rabinovich, and A. C. Berg. ParseNet: Looking
wider to see better. In ICLR workshop, 2016.
[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
networks for semantic segmentation. In CVPR, 2015.
[25] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 2004.
[26] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
works for human pose estimation. In ECCV, 2016.
[27] P. O. Pinheiro, R. Collobert, and P. Dollar. Learning to seg-
ment object candidates. In NIPS, 2015.
[28] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn-
ing to refine object segments. In ECCV, 2016.
[29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
wards real-time object detection with region proposal net-
works. In NIPS, 2015.
[30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object
detection networks on convolutional feature maps. PAMI,
2016.
[31] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu-
tional networks for biomedical image segmentation. In MIC-
CAI, 2015.
[32] H. Rowley, S. Baluja, and T. Kanade. Human face detec-
tion in visual scenes. Technical Report CMU-CS-95-158R,
Carnegie Mellon University, 1995.
[33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
Recognition Challenge. IJCV, 2015.
[34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
and Y. LeCun. Overfeat: Integrated recognition, localization
and detection using convolutional networks. In ICLR, 2014.
[35] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
based object detectors with online hard example mining. In
CVPR, 2016.
[36] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. In ICLR, 2015.
[37] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.
Smeulders. Selective search for object recognition. IJCV,
2013.
10