Object Discovery in 3D Scenes Via Shape Analysis: Andrej Karpathy, Stephen Miller and Li Fei-Fei
Object Discovery in 3D Scenes Via Shape Analysis: Andrej Karpathy, Stephen Miller and Li Fei-Fei
Object Discovery in 3D Scenes Via Shape Analysis: Andrej Karpathy, Stephen Miller and Li Fei-Fei
(1n
i
n
j
)
2
if (p
j
p
i
) n
j
> 0
1n
i
n
j
otherwise
(1)
Where the squared term serves to penalize convex edges
less than concave edges. Note that a perfectly planar patch
will produce edges with weight 0.
We experimented with incorporating color into the sim-
ilarity metric between points, but found that our attempts
consistently lowered the overall performance of the system.
We speculate that this could be due to signicant lighting
variations present in our dataset. More generally, we do not
make use of color information throughout the algorithm, but
still display colored meshes in gures for ease of interpreta-
tion.
Segment post-processing. Following the original imple-
mentation of the graph segmentation algorithm, we place
a hard threshold on the minimum number of points m
size
that are allowed to constitute a valid segment and greedily
merge any smaller segments to neighboring segments. We
use m
size
= 500, which with our data density corresponds to
a shape about half the size of a computer mouse.
Hard thresholding For added efciency, we reject any
segments that are more than 1m in size, or less than 2cm
thin. In addition, denoting in decreasing order the eigenvalues
of the scatter matrix as
0
,
1
,
2
we also reject segments
that are, in relative terms, too thin (
0
< 0.05), or too at
(
0
< 0.001). These thresholds settings are conservative and
are not high enough to lter thin objects such as monitors.
Non-maximum suppression Inevitably, some segments
will be obtained multiple times across different settings of the
segmentation algorithms granularity parameter. We detect
such cases by computing intersection-over-union of vertices
belonging to all segments. If two segments are found to be
too similar (we use threshold of 0.7), we greedily retain the
more object-like segment, computed as the average of the
segments shape measures. We explain these measures next.
Fig. 4. Example of one of the segmentations of a scene. At this threshold,
some objects are correctly identied while others, such as the headphones
and monitor, are over-segmented.
B. Objectness measures
Every segment identied during the segmentation step is
evaluated using six objectness measures: ve shape measures
that are evaluated on every segment individually and a shape
reccurrence measure. The recurrence measure is inspired by
prior work [23], [22] that has identied repeated presence
of a piece of geometry across space or time as evidence for
objectness. We now explain all measures in more detail.
Compactness rewards segments that contain structure in
compact volume. Intuitively, this captures the bias of most
objects to being approximately spherical. We quantify this
notion by computing the ratio of the total surface area of the
segments mesh to the surface area of its smallest bounding
sphere.
Symmetry. Objects often exhibit symmetries and their
role in visual perception has been explored in psychology
[27]. Since the computational complexity of our method is a
design consideration, we only consider evaluating reective
symmetry along the three principal axes of each segment.
More specically, we reect the segment along a principal
axis and measure the overlap between the original segment
and its reection. That is, denoting =
x
+
y
+
z
to be
the sum of eigenvalues of the scatter matrix, and r
x
, r
y
, r
z
to
be the extent of the segment along each of its principal axes,
we calculate the symmetry of a cloud C as:
Symmetry(C) =
d{x,y,z}
[(O(C,C
d
, r
d
) +O(C
d
,C, r
d
)]
where C
d
denotes reection of cloud C along direction d.
The one-sided overlap O between two clouds is calculated
by summing up the difference in the position and direction of
the normal from a point in one cloud to its nearest neighbor
in the other:
Fig. 3. A visualization of our algorithm: every 3D input mesh is over-segmented into a large collection of segments. Each segment is ranked using our
objectness measures and the nal ranking is computed. The last image highlights the top 5 objects found in the example scene.
O(C
1
,C
2
, r) =
i=1..|C
1
|
1
r
||p
C
1
i
p
C
2
N(C
2
,p
i
)
|| +(1n
C
1
i
n
C
2
N(C
2
,p
i
)
)
where p
C
i
denotes the ith point in cloud C, similarly n
C
i
is the normal at point p
i
and N(C, p) evaluates to the index
of the closest point to p in cloud C. Note that r is used to
normalize the distances based on the segments absolute size.
Finally, is a tunable parameter that trades off the relative
importance of the two contributions (we use = 0.2).
Smoothness stipulates that every point on the mesh should
have mass spread out uniformly around it. Intuitively, the
presence of thin regions will cause a segment to score low,
while neatly connected surfaces will score high. To compute
the value of this measure at a single point p, we rst project
points in a local neighborhood around p to the tangent plane
dened by its normal. Next, we quantize the angle of the
projected points in the local 2D coordinate system into b
bins and compute the entropy of the distribution. Here, high
entropy indicates high smoothness. We repeat this procedure
at each point and average the result across all points in the
segment. In practice, we use b = 8 and local neighborhoods
with radius 1cm.
Local Convexity. Surfaces of objects are often made up
of locally convex regions. We determine the convexity of
each polygon edge as given by the predicate in Equation 1
and score each segment by the percentage of its edges which
are convex. Global convexity. Visual perception studies have
shown that the human visual system uses a global convexity
prior when inferring 3D shape [28], [29]. Taking inspiration
from these results, we also consider measuring the degree
to which an objects convex hull is an approximation to the
objects shape. To evaluate this measure, we compute the
convex hull and record the average distance from a point on
the object to the closest point on the convex hull.
Recurrence. Segments that are commonly found in other
scenes are more likely to be an object rather than a seg-
mentation artifact. Thus, for every segment we measure the
average distance to the top k most similar segments in other
scenes. In our experiments, we use k = 10.
There are several approaches one could use to quantify
the distance between two segments. Prior work [22] has
proposed computing local features on every object and using
RANSAC followed by Iterative Closest Point algorithm to
compute a rigid alignment. However, we found this strategy
to be computationally too expensive. In Computer Vision,
a standard approach is to compute visual bag of words
representations from FPFH features [30] or spin images and
match them using chi-squared kernels, but we found that
while this approach gave reasonable results, it was also
computationally too expensive.
To keep the computational costs low, we found it is
sufcient to retrieve segments of comparable sizes that have
similar shape measures. Concretely, to retrieve the most
similar segments to a given query segment, we consider all
segments within 25% of extent along principal directions in
size and measure the euclidean distance between their nor-
malized shape measures. Each measure is normalized to be
zero mean and unit variance during the retrieval. As a result,
our recurrence measure does not enforce exact alignment but
merely identies segments that have commonly occurring
statistical properties, as dened by our shape measures.
Examples of nearest neighbor retrievals with this measure
can be seen in gure 6.
Fig. 6. We show a query segment (left) and its 10 closest matches among
all segments (right). These segments are retrieved from the entire set of
1836 segments across 58 scenes. Note that the third down are all cups, 5th
row are all mice, and 8th row are mostly staplers.
Fig. 5. In every example scene above we highlight the top few object hypotheses, using the linear SVM as the predictor.
C. Data-driven combination
We consider several options for combining the proposed
measures into one objectness score: Simple averaging, Naive
Bayes, Linear Support Vector Machine, RBF Kernel Support
Vector Machine, and Nearest Neighbor.
To obtain ground truth training labels, we manually an-
notated all extracted segments as being an object or not.
The labeling protocol we used is as follows. A segment
is annotated as an object when it is an exact and full
segmentation of a semantically interpretable part of the
scene. If the segment contains surrounding clutter in addition
to the object, it is marked false. If a segment is only an object
part that does not normally occur in the world in isolation,
it is also marked false (for example, the top part of a stapler,
the keypad on a telephone, the cover page of a book, etc.).
V. RESULTS
We evaluated our method on the dataset described in
Section III. Over-segmentation of all 58 scenes leads to a
total of 1836 segments, of which we identied 303 as objects
using the labeling protocol described in Section IV-C.
We treat the task of identifying objects as a binary
classication problem. To construct the data matrix we
concatenate all measures into a 1836x6 matrix and normalize
each column to be zero mean and unit variance. Next, we
randomly assign half of the data to training set and half
to the testing set. We perform 5-fold cross-validation on all
classier parameters using grid search. The entire process
is repeated 20 times for different random splits of the data
and the average result is reported. Quantitative analysis of
the performance is shown in Figure 7. Example results for
object hypotheses can be seen visually in Figure 5.
Limitations. The system is capable of reliably distin-
guishing objects once they are identied as potential object
candidates, but there are a few common failure cases that
cause the system to incorrectly miss an object candidate
during the over-segmentation stage:
3D mesh reconstruction: A few failure cases are tied
directly to the mesh reconstruction step. Due to the
limited resolution of Kinect Fusions volumetric rep-
resentation, small neighboring objects may be fused
together, causing the algorithm to undersegment these
regions. Moreover, RGB-D sensors do not handle trans-
parent objects well, but transparent objects (bottles,
plastic cups, glass tables) are relatively frequent in
regular scenes. This can lead to noisy segments with
large discontinuities in the reconstruction that cause
our algorithm to over-segment these regions. Lastly,
the resolution of the Marching Cubes reconstruction is
limited by GPU memory. Low-resolution reconstruction
can cause thin objects such as paper notebooks or thin
keyboards to fuse into their supporting plane and not
get discovered.
Non-maximum suppression: An object part that occupies
a large fraction of the entire object can be judged by
the algorithm to be much more object-like, which can
cause the algorithm to incorrectly reject the entire object
as a candidate. For instance, the two ear pieces of a
pair of headphones tend to appear more objectlike in
isolation than when connected by a thin plastic band.
Fig. 7. Precision vs. Recall curves for objectness measures and their combinations. Color-coded bar plots show Average Precisions.
Similarly, the cylindrical portion of a mug often appears
more objectlike than it would with the handle attached.
Segmentation algorithm: Our segmentation algorithm is
a compromise between speed and accuracy. Due to its
limitations, it is particularly prone to over-segmenting
extended objects that contain intricate structure. An
example of such an object is a plant with many leaves. In
addition, the segmentation algorithm will never consider
joining two pieces that are not physically connected.
For example, a transparent bottle with some liquid can
easily become two disconnected segments: the body and
the oating cap. As a result, the algorithm will never
consider joining these segments into one candidate
object.
To estimate the extent of the problem quantitatively, we
manually analyzed the recall of the system by counting the
number of objects in each scene that should reasonably be
identied as objects. We count on the order of 400 objects
present in our dataset. Since we have 303 positive labels, we
estimate the recall of the system to be roughly 75%. Figure
8 illustrates examples of failure cases visually.
Quantitative analysis. As can be seen on Figure 7, the
individual measures perform relatively poorly alone, but
their combinations achieve impressive levels of performance.
Moreover, it is interesting to note that even an unsupervised
combination of our measures by means of simple averag-
ing performs competitively: the top performer (RBF kernel
SVM) achieves 0.92 Average Precision, while averaging
achieves 0.86.
We further seek to understand the contribution of indi-
vidual measures by repeating the entire experiment with
and without them. We use the RBF kernel SVM for these
experiments as it has been shown to work best in our data.
First, removing recurrence decreases performance of the
system from 0.92 to 0.90 AP. Removing Symmetry, Local
Fig. 8. Examples of limitations. 1: Only main part of the headphones will
be identied as a candidate object. 2: Cups are fused and get segmented
together as a single object candidate. 3: The armrest of the chair will be
incorrectly identied as a strong object. 4: Due to transparency, the top will
appear to be oating and gets disconnected from the bottle. 5: The plant
is too intricate and contains too much variation to be selected as a single
object. 6: The interior of the cup will be selected as a separate segment
because the curvature changes too quickly around its rim.
and Global Convexity similarly decrease performance by
2-3 points, but Compactness and Smoothness decrease the
performance more signicantly to 0.85 and 0.82 respectively.
This hints that Compactness and Smoothness may be two of
our strongest measures. However, using Compactness and
Smoothness alone only achieves 0.81 AP, which indicates
that the other measures still contribute meaningful informa-
tion to the nal result.
Computational complexity.
As motivated during the introduction, an important consid-
eration for the design of our algorithm is its computational
complexity.
Fig. 9. Confusion matrix for the RBF Kernel SVM.
Asymptotic analysis. Denoting N to be the number of
scenes and S to be the average number of segments per
scene (in this work N = 58 and S = 31), the complexity
of the method is O(N) for mesh reconstruction, O(SN) to
evaluate the shape measures on all segments individually, and
O((SN)
2
) to evaluate the recurrence measure. Even though a
naive implementation of the recurrence measure is quadratic
in the total number of segments, it is empirically the most
efcient measure to compute on dataset of our size. Efcient
k-nearest-neighbor algorithms such as FLANN [31] can be
used to further speed up the retrieval process.
Kinect Fusion. We computed the 3D meshes using the
Open Source Kinect Fusion implementation [3] on an 8 core
2.2GHz laptop with the GTX 570m GPU. The process of
converting the RGB-D video into 3D mesh took 2 minutes
per scene on average.
Measure computation. We further report computational
time for an average scene with 200,000 vertices and 400,000
polygons on a 2.8GHz workstation, using a single-threaded
implementation in C++:
Step Time(s)
Over-segmentation 1.5
Compactness 0.1
Symmetry 3.8
Global Convexity 13.3
Local Convexity 1.3
Smoothness 2.5
Recurrence 0.1
Total 25
The entire 58 scene dataset can therefore be processed in
about 25 minutes. As can be seen in the table above, the
global convexity measure is by far the slowest step as it
requires computing the convex hull.
VI. CONCLUSION
We presented an approach for object discovery in a collec-
tion of 3D meshes. Our algorithm is computationally efcient
(running at about 25 seconds per scene on an average
computer) and can process scenes independently and online.
The core of the method relies on a set of proposed objectness
measures that evaluate how likely a single mesh segment
is to be an object. We demonstrated that these measures
can be averaged to reliably identify objects in scenes and
showed that a supervised combination can further increase
performance up to 0.92 Average Precision. We released a
dataset of 58 challenging environments to the public.
We estimated the overall recall of the system to be around
75% and qualitatively analyzed sources of error. The most
common sources of error can be traced to limitations in data
acquisition when dealing with transparent materials and the
resolution of the resulting 3D mesh. While the simplicity of
the segmentation algorithm allowed us to process scenes at
very fast rates (segmenting an entire scene 10 times using
different thresholds in 1.5 seconds), a more sophisticated
formulation is necessary to ensure that complicated objects
(such as the plant example in Figure 8) are segmented as a
single candidate object.
Future work includes increasing the recall of the system
by improving the segmentation stage of the algorithm and
by reasoning about segments in the context of the scene in
which they were found.
ACKNOWLEDGMENT
This research is partially supported by an Intel ISTC
research grant. Stephen Miller is supported by the Hertz
Foundation Google Fellowship and the Stanford Graduate
Fellowship.
REFERENCES
[1] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davi-
son, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon, Kinectfusion:
Real-time dense surface mapping and tracking, in Mixed and Aug-
mented Reality (ISMAR), 2011 10th IEEE International Symposium
on. IEEE, 2011, pp. 127136.
[2] T. Whelan, J. McDonald, M. Kaess, M. Fallon, H. Johannsson, and
J. Leonard, Kintinuous: Spatially extended KinectFusion, in RSS
Workshop on RGB-D: Advanced Reasoning with Depth Cameras,
Sydney, Australia, Jul 2012.
[3] R. B. Rusu and S. Cousins, 3D is here: Point Cloud Library (PCL), in
IEEE International Conference on Robotics and Automation (ICRA),
Shanghai, China, May 9-13 2011.
[4] H. Arora, N. Loeff, D. Forsyth, and N. Ahuja, Unsupervised seg-
mentation of objects using efcient learning, in Computer Vision and
Pattern Recognition, 2007. CVPR07. IEEE Conference on. IEEE,
2007, pp. 17.
[5] M. Fritz and B. Schiele, Decomposition, discovery and detection of
visual categories using topic models, in Computer Vision and Pattern
Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008,
pp. 18.
[6] G. Kim, C. Faloutsos, and M. Hebert, Unsupervised modeling of
object categories using link analysis techniques, in Computer Vision
and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.
IEEE, 2008, pp. 18.
[7] B. Russell, W. Freeman, A. Efros, J. Sivic, and A. Zisserman, Using
multiple segmentations to discover objects and their extent in image
collections, in Computer Vision and Pattern Recognition, 2006 IEEE
Computer Society Conference on, vol. 2. Ieee, 2006, pp. 16051614.
[8] N. Payet and S. Todorovic, From a set of shapes to object discovery,
Computer VisionECCV 2010, pp. 5770, 2010.
[9] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman, Dis-
covering objects and their location in images, in Computer Vision,
2005. ICCV 2005. Tenth IEEE International Conference on, vol. 1.
Ieee, 2005, pp. 370377.
[10] J. Sivic, B. Russell, A. Zisserman, W. Freeman, and A. Efros, Un-
supervised discovery of visual object class hierarchies, in Computer
Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference
on. IEEE, 2008, pp. 18.
[11] M. Weber, M. Welling, and P. Perona, Towards automatic discovery
of object categories, in Computer Vision and Pattern Recognition,
2000. Proceedings. IEEE Conference on, vol. 2. IEEE, 2000, pp.
101108.
[12] I. Endres and D. Hoiem, Category independent object proposals,
Computer VisionECCV 2010, pp. 575588, 2010.
[13] S. Vicente, C. Rother, and V. Kolmogorov, Object cosegmentation,
in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE
Conference on. IEEE, 2011, pp. 22172224.
[14] H. Kang, M. Hebert, and T. Kanade, Discovering object instances
from scenes of daily living, in Computer Vision (ICCV), 2011 IEEE
International Conference on. IEEE, 2011, pp. 762769.
[15] M. Cho, Y. Shin, and K. Lee, Unsupervised detection and segmenta-
tion of identical objects, in Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 16171624.
[16] T. Tuytelaars, C. Lampert, M. Blaschko, and W. Buntine, Unsu-
pervised object discovery: A comparison, International journal of
computer vision, vol. 88, no. 2, pp. 284302, 2010.
[17] B. Alexe, T. Deselaers, and V. Ferrari, What is an object? in Com-
puter Vision and Pattern Recognition (CVPR), 2010 IEEE Conference
on. IEEE, 2010, pp. 7380.
[18] F. Endres, C. Plagemann, C. Stachniss, and W. Burgard, Unsupervised
discovery of object classes from range data using latent dirichlet
allocation, in Proc. of Robotics: Science and Systems, 2009.
[19] R. Triebel, J. Shin, and R. Siegwart, Segmentation and unsupervised
part-based discovery of repetitive objects, in Robotics: Science and
Systems, vol. 2, 2010.
[20] J. Shin, R. Triebel, and R. Siegwart, Unsupervised 3d object dis-
covery and categorization for mobile robots, in Proc. of The 15th
International Symposium on Robotics Research (ISRR), 2011.
[21] A. Collet, S. Srinivasay, and M. Hebert, Structure discovery in multi-
modal data: a region-based approach, in Robotics and Automation
(ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp.
56955702.
[22] J. Shin, R. Triebel, and R. Siegwart, Unsupervised discovery of
repetitive objects, in Robotics and Automation (ICRA), 2010 IEEE
International Conference on. IEEE, 2010, pp. 50415046.
[23] E. Herbst, P. Henry, X. Ren, and D. Fox, Toward object discovery
and modeling via 3-d scene comparison, in Robotics and Automation
(ICRA), 2011 IEEE International Conference on. IEEE, 2011, pp.
26232629.
[24] E. Herbst, X. Ren, and D. Fox, Rgb-d object discovery via multi-
scene analysis, in Intelligent Robots and Systems (IROS), 2011
IEEE/RSJ International Conference on. IEEE, 2011, pp. 48504856.
[25] P. Felzenszwalb and D. Huttenlocher, Efcient graph-based image
segmentation, International Journal of Computer Vision, vol. 59,
no. 2, pp. 167181, 2004.
[26] F. Moosmann, O. Pink, and C. Stiller, Segmentation of 3d lidar data
in non-at urban environments using a local convexity criterion, in
Intelligent Vehicles Symposium, 2009 IEEE, june 2009, pp. 215 220.
[27] S. Palmer, The role of symmetry in shape perception, Acta Psycho-
logica, vol. 59, no. 1, pp. 6790, 1985.
[28] M. Langer and H. Bulthoff, A prior for global convexity in local
shape-from-shading, PERCEPTION-LONDON-, vol. 30, no. 4, pp.
403410, 2001.
[29] D. Kersten, P. Mamassian, and A. Yuille, Object perception as
bayesian inference, Annu. Rev. Psychol., vol. 55, pp. 271304, 2004.
[30] R. B. Rusu, N. Blodow, and M. Beetz, Fast point
feature histograms (fpfh) for 3d registration, in The
IEEE International Conference on Robotics and Automation
(ICRA), Kobe, Japan, 05/2009 2009. [Online]. Available:
http://les.rbrusu.com/publications/Rusu09ICRA.pdf
[31] M. Muja and D. G. Lowe, Fast approximate nearest neighbors with
automatic algorithm conguration, in International Conference on
Computer Vision Theory and Application VISSAPP09). INSTICC
Press, 2009, pp. 331340.