FastLSVM MLDM09
FastLSVM MLDM09
FastLSVM MLDM09
1 Introduction
The direct integration of k-nearest neighbors (kNN) with support vector ma-
chines (SVM) has been proposed in [1]. The algorithm, that belongs to the class
of local learning algorithm [2], is called kNNSVM, and it builds a maximal mar-
gin classifier on the neighborhood of a test sample in the feature space induced
by a kernel function. Theoretically, it permits better generalization power than
SVM because, like all local learning algorithms, the locality parameter permits to
find a lower minimum of the guaranteed risk [3,4] and since it can have, for some
values of k, a lower radius/margin bound [5]. It has been successfully applied
for remote sensing tasks [1] and on 13 small benchmark datasets [6], confirming
the potentialities of this approach. kNNSVM can be seen as a method for in-
tegrating locality in kernel methods compatible with the traditional strategy of
using local non-stationary kernel functions [7] and it is particularly indicated for
non high-dimensional problems, i.e. for data requiring some non linear mapping
(kernel) to be successfully tackled.
The main drawback of the original idea of Local SVM concerns the compu-
tational performances. The prediction phase is in fact very slow since for each
query point it is necessary to train a specific SVM before performing the classi-
fication, in addition to the selection of its k-nearest neighbors on which the local
SVM is trained. In [8] it has been independently proposed a similar method in
which however the distance function for the kNN operations is performed in the
input space and it is approximated with a “crude” distance metric in order to
improve the computational performances.
In this work we developed a fast local support vector machine classifier, called
FastLSVM, introducing various modifications to the Local SVM approach in or-
der to make it scalable and thus suitable for large datasets. Differently from [8]
we maintain the feature space metric for the nearest neighbor operations and
we do not adopt any approximation on the distance function and thus on the
neighborhood selection. We aim, in fact, to be as close as possible to the origi-
nal formulation of kNNSVM in order to maintain its theoretical and empirical
advantages over SVM. Moreover, our intuition is that, in general, as the number
of samples in the training size increases, also the positive effect of locality on
classification accuracy increases. Roughly speaking, the idea is to precompute a
set of local SVMs covering (with redundancy) all the training set and to apply
to a query point the model to which its nearest neighbor in the training set has
been assigned. The training time complexity analysis reveals that the approach
is asymptotically faster than the state-of-the-art accurate SVM solvers and the
training of the local models can be very easily parallelized. Notice that the issue
of scalability for the local SVM approach is particularly appealing also because
our intuition is that locality can play a more crucial role as the problem be-
comes larger and larger and the ideal decision function is complex and highly
non-linear.
The source code of FastLSVM is part of the Fast Local Kernel Machine Li-
brary (FaLKM-lib) [9] freely available for research and education purposes; the
FastLSVM implementation we use in this work is a preliminary version of the
FaLK-SVM classifier available in FaLKM-lib.
In the rest of the introduction we briefly review the related work and the main
topics necessary to understand the FastLSVM approach discussed in Section 2.
Section 3 details the experimental evaluation we conducted before drawing some
conclusions and discussing further extensions in Section 4.
very peripheral regions of the local models. Moreover the clusters have problems
of class balancing and their dimensions cannot be controlled thus not assuring
the SVM optimization to be small enough. The computational performances
(only empirically tested on a small dataset) are in fact much worse than SVM
(although better than their local approach) and seems to decrease asymptotically
much faster than SVM.
Multiple approaches have been proposed in order to overcome SVM compu-
tational limitation for large datasets approximating the traditional approach.
Two of the most popular and effective techniques are Core Vector Machines [11]
(CVM) based on minimum enclosing ball algorithms and LaSVM [12] which
introduces an online support vector removal step in the optimization. Other
proposed approaches were based on parallel mixture of SVMs trained on subsets
of the training set [13,14], on using editing or clustering techniques to select
the more informative samples [15], on training SVM between clusters of differ-
ent class nearest to the query point [16] and on parallel algorithms for training
phase [17,18].
Recently very fast algorithms have been proposed for linear SVM like SVM-
Perf [19] and LibLinear [20] . However, we are focusing here on large datasets
with non high-dimensionality and thus the use of a non-linear kernel is crucial.
It is important to underline, however, that what we are proposing here is not
a method to approximate SVM in order to enhance performances. Our main
purpose is to make kNNSVM, which has been shown to be more accurate of
SVM for small datasets, suitable for large scale problems. Indirectly, since the
method is asymptotically faster than SVM, it can be seen as an alternative to
SVM for large datasets on which traditional SVM algorithms cannot be directly
applied.
In this way, xrx (j) is the point of the set X in the j-th position in terms of
distance from x , namely the j-th nearest neighbor, xrx (j) − x is its distance
from x and yrx (j) is its class with yrx (j) ∈ {−1, 1}. In other terms: j < k ⇒
xrx (j) − x ≤ xrx (k) − x . With this definition, the majority decision rule of
kNN for binary classification is defined by kN N (x) = sign( ki=1 yrx (i) ).
298 N. Segata and E. Blanzieri
SVMs [21] are classifiers with sound foundations in statistical learning the-
ory [4]. The decision rule is SV M (x) = sign(w, Φ(x) H + b) where Φ(x) :
Êp → H is a mapping in a transformed Hilbert feature space H with inner
product ·, · H . The parameters w ∈ H and b ∈ Ê are such that they mini-
mize an upper bound on the expected risk while minimizing the empirical risk.
The empirical risk is controlled through the set of constraints yi (w, Φ(xi ) H +
b ≥ 1 − ξi with ξi ≥ 0, i = 1, . . . , n, where yi ∈ {−1, +1} is the class la-
bel of the i-th nearest training sample. The presence of the slack variables
ξi ’s allows some misclassification on the training set. Reformulating such an
optimization problem with Lagrange multipliers αi (i = 1, . . . , n), and intro-
ducing a positive definite kernel (PD) function1 K(·, ·) that substitutes the
scalar product in the feature space Φ(xi ), Φ(x) H , the decision rule can be
expressed as
n
SV M (x) = sign αi yi K(xi , x) + b .
i=1
PD kernels avoids the explicit definition of H and Φ [22]; the most popular are
the linear (LIN) kernel k lin (x, x ) = x, x , the radial basis function (RBF)
kernel k rbf (x, x ) = exp x − x 2 /σ where σ is a positive constant, and the
inhomogeneous polynomial (IPOL) kernel k ipol (x, x ) = (x, x + 1)d where d
is the degree of kernel. SVM has been shown to have important generalization
properties and nice bounds on the VC dimension [4].
Computationally, an accurate solver for SVM takes O(n2 ) time for comput-
ing the kernel values, O(n3 ) time for solving the problem and O(n2 ) space for
storing the kernel values as discussed in [11,23]; empirical evidence highlights
that modern accurate SVM solvers like LibSVM [24] scale effectively between
n2 and n3 depending mainly on C (the higher the value of C the closer the
scaling to n3 ). Approximate solutions (see Section 1.1) can of course lower the
computational complexity.
1
We refer to kernel functions with K and to the number of nearest neighbors with k.
Fast Local Support Vector Machines for Large Datasets 299
where rx : {1, . . . , n} → {1, . . . , n} is a function that reorders the indexes of the
training points as follows:
⎧
⎪
⎪ rx (1) = argmin Φ(xi ) − Φ(x )2
⎪
⎨ i=1,...,n
rx (j) = argmin Φ(xi ) − Φ(x )2
⎪
⎪
⎪
⎩
i=1,...,n
i = rx (1), . . . , rx (j − 1) for j = 2, . . . , n
In this way, xrx (j) is the point of the set X in the j-th position in terms of
distance from x and the thus j < k ⇒ Φ(xrx (j) ) − Φ(x ) ≤ Φ(xrx (k) ) −
Φ(x ). The computation is expressed as ||Φ(x) − Φ(x )||2 = Φ(x), Φ(x) H +
Φ(x ), Φ(x ) H −2·Φ(x), Φ(x ) H = K(x, x)+K(x , x )−2·K(x, x ). If the kernel
is the RBF kernel or any polynomial kernels with degree 1, the ordering function
can be built using the Euclidean metric. For non-linear kernels (other than the
RBF kernel) the ordering function can be quite different to that produced using
the Euclidean metric. The decision rule of this method is:
k
kNNSVM(x) = sign αrx (i) yrx (i) K(xrx (i) , x) + b (1)
i=1
For k = n, kNNSVM becomes the usual SVM whereas, for k = 2 with LIN or
RBF kernels, corresponds to the NN classifier. The method is computationally
expensive because, for each test point, it computes the kNN in H, train an
SVM and finally perform SVM prediction. Implementing kNN simply sorting
the distances, kNNSVM takes O(n log n · k 3 · m) time for m testing samples.
Like all the class of local learning algorithms, kNNSVM states the learning
problem in a different setting as detailed in [3]. Basically, instead of estimating
a global decision function with the aim of minimizing the probability of errors of
all possible unseen samples, kNNSVM tries to estimate a decision function that
maximize the probability of correctly label a given test point. Notice that for
kNN (the simplest local learning algorithm) this learning statement is crucial
because the majority rule is effective only locally (globally it reduces to the
class with the highest cardinality). With respect to global SVM, the possibility
of estimating a different maximal margin hyperplane for each test point can
thus achieve a lower probability of misclassification on the whole test set. These
considerations are formalized in the theory of local structural risk minimization
for local learning algorithms [3] which is a generalization of the structural risk
minimization [4]. The main idea is that, in addition to the complexity of the
class of possible functions and of the function itself, the choice of the locality
parameter (k for kNNSVM) can help to lower the guaranteed risk.
An implementation of kNNSVM, called FkNNSVM, is available in the freely
available Fast Local Kernel Machine Library (FaLKM-lib) [9].
the decision rule of kNNSVM considering the case in which the local model is
trained on a set of points that are the k-nearest neighbors of a point that, in
general, is different from the query point. A modified decision function for a
query point x and another (possibly different) point t is:
k
kNNSVMt (x) = sign αrt (i) yrt (i) K(xrt (i) , x) + b
i=1
where rt (i) is the kNNSVM ordering function (see above) and αrt (i) and b come
from the training of an SVM on the k-nearest neighbors of t in the feature space.
In the following we will refer to kNNSVMt (x) as being centered in t and to t as
the center of the model. The original decision function of kNNSVM corresponds
to the case in which t = x, and thus kNNSVMx (x) = kNNSVM(x).
With the previous modification of kNNSVM we made the prediction step much
more computationally efficient, but a considerable overhead is added to the train-
ing phase. In fact, the training of an SVM for every point of the training set can
be slower than the training of a unique global SVM (especially for non small
k values), so we introduce another modification of the method which aims to
Fast Local Support Vector Machines for Large Datasets 301
Definition 1 means that the union of the sets of the k -nearest neighbors of Ck
corresponds to the whole training set. Theoretically, for a fixed k , the minimiza-
tion of the number of local SVMs that we need to train can be obtained comput-
ing the SVMs centered on the points contained in the minimal k -neighborhood
covering set of centers2 C. However, since the computing of the minimal C is
not a simple and computationally easy task, we choose to select each ci ∈ C as
follows:
ci = xj ∈ X
with j = min z ∈ {1, . . . , n}xz ∈ X \ Xci
(2)
where Xci = xrcl (h) h = 1, . . . , k .
l<i
The idea of this definition is to recursively take as centers those points which
are not k -neighbors of any point that has already been taken as center. So
c1 = x1 corresponds to the first point of X since, being c1 the first center, the
union of the neighbors of the other centers is empty; c2 , instead, is the point
with the minimum index taken from the set obtained eliminating from X all
the k -neighbors of c1 . The procedure is repeated until all the training points
are removed from X. X must be thought here as a random reordering of the
training set. This is done in order to avoid the possibility that a training set in
which the points are inserted with a particular spatial strategy affects the spatial
distribution of the k -neighborhood covering centers.
The reason why we adopt this non standard clustering method is twofold:
from one side we want each cluster to contain exactly k samples in order to
be able to derive rigourous complexity bounds, from the other side in this way
we are able to select a variable number of samples that are in the central region
2
From now on we simply denote Ck with C because we do not discuss here particular
values for k .
302 N. Segata and E. Blanzieri
(at least form a neighborhood viewpoint) of each cluster. Moreover the proposed
clustering strategy follows quite naturally from kNNSVM approach.
Differently from the first approximation in which a local SVM is trained for
each training sample, in this case we need to train only |C| SVMs centered on
each c ∈ C obtaining the following models:
kNNSVMc (x), ∀c ∈ C.
Now we have to link the points of the training set with the precomputed SVM
models. This is necessary because a point can lie in the k neighborhood of
more than one center. In particular we want to consider the assignments of
each training point to a unique model such that it is in the k neighborhood of
the center on which the model is built. Formally this is done with the function
cnt(t) : X → C that assigns each point in the training set to a center:
cnt(xi ) = xj ∈ C
with j = min z ∈ {1, . . . , n}xz ∈ C and xi ∈ Xxz (3)
where Xxz = xrxz (h) h = 1, . . . , k .
With the cnt function, each training point is assigned to the first center whose
k -nearest neighbors set includes the training point itself. The order of the ci
points derives from the randomization of X used for defining C. In this way each
training point is univocally assigned to a center and so the decision function of
this approximation of Local SVM, called FastLSVM, is simply:
a global SVM computable, as expected, in O(n log n + n3) = O(n3 ) since |C| = 1.
kNNSVM testing is instead slightly slower than SVM: O(n·k·m) against O(n·m).
Although not considered in the implemented version, FastLSVM can take great
advantages from data-structures supporting nearest neighbors searches [26]. For
example, using the recently developed cover tree data-structure [27] allowing
kNN searches in k log(n) with n log n construction time, FastLSVM can further
decrease its training computational complexity to O(n log n+|C|·log n·k+|C|·k 3 )
which is much lower than SVM complexity for fixed and non-high values of k.
Similarly, for testing, the required time becomes O(log n · k · m). Another not
implemented modification able to reduce computational complexity consists in
avoiding the training of local SVMs with samples of one class only.
Moreover, FastLSVM can be very easily parallelized differently from SVM for
which parallelization, although possible [17,18], is a rather critical aspect; for
FastLSVM is sufficient that, every time the points for a model are retrieved, the
training of the local SVM is performed on a different processor. In this way the
304 N. Segata and E. Blanzieri
3 Empirical Evaluation
In this work we used LibSVM (version 2.85) [24] for SVM enabling shrinking and
caching, and our implementation of FastLSVM and kNNSVM that use LibSVM
for training and prediction of the local SVMs and a simple brute-force imple-
TM
mentation of kNN3 . The experiments are carried out on an AMD Athlon 64
X2 Dual Core Processor 5000+, 2600MHz, with 3.56Gb of RAM.
using c = 1/500 for the first class (yi = +1) and c = −1/500 for the second class
(yi = −1). The points are sampled with intervals of π/5000 on the τ parameter
obtaining 50000 points for each class. A Gaussian noise with zero mean and
variance proportional to the distance between the point and the nearest internal
twist is added on both dimensions. With this procedure we generated to different
datasets of 100000 points each for training and testing.
306 N. Segata and E. Blanzieri
Table 2. Percentage accuracy and computational (in seconds) results for SVM and
FastLSVM on the 2SPIRAL dataset. The parameters reported are the one permitting
the lowest empirical risk, found with 5-fold CV.
We compare FastLSVM and SVM using LIN, RBF and IPOL (with degree
2) kernels. Since LIN and IPOL kernels can only build linear and quadratic
decision functions in the input space, they cannot give satisfactory results for
global SVM and thus we do not loose generality in presenting SVM results with
the RBF kernel only. For model selection we adopt grid search with 5-fold CV.
For both methods, C and σ of RBF kernel are chosen in {2−10 , 2−9 , . . . , 29 , 210 }.
It is possible that values higher than 210 for C and lower than 2−10 for σ could
give higher validation accuracy results, but the computational overhead of SVM
becomes too high to be suitable in practice (e.g. RBF-SVM with C = 211 , σ =
2−11 requires more than 24 hours). For FastLSVM we fix k = k/4 (intuitively a
good compromise between accuracy and performance), while k is chosen among
{0.25%, 0.5%, 1%, 2%, 4%, 8%, 16%, 32%} of training set size.
Table 2 shows the results obtained for SVM and FastLSVM. The results high-
lights that RBF-FastLSVM improves over RBF-SVM in test accuracy of 8.65%,
LIN-FastLSVM of 8.69% and IPOL-FastLSVM of 8.63%. The improvements on
classification accuracies are accomplished by a dramatic increase of computa-
tional performances for training phase: while the time needed to compute the
global SVM on the training set is more than 100 minutes (6185 seconds), the
training of FastLSVM requires no more than 4 minutes. The best prediction time,
instead, is achieved by SVM although RBF-FastLSVM and LIN-FastLSVM give
comparable performances; the prediction time of IPOL-FastLSVM is, instead,
about an order of magnitude higher than RBF-SVM.
Table 3. Percentage accuracies and performances (in seconds) of SVM and FastLSVM
on CoverType data
Table 4. Training times of LaSVM (from [12]), CVM (from [11]) and FastLSVM
normalized with the training times of LibSVM on 100000 samples taken from the
corresponding works (LibSVM 100000 samples training time in [12] is 10310s, in [11]
is about 20000s, in this work 7583s)
model on the whole training set faster than SVM on 100000 samples (less than
1/5 of the data); moreover, starting from n = 30000 FastLSVM training is at
least one order of magnitude faster than SVM, and the difference is more and
more relevant as n increases. It is important to underline that the whole dataset
permits a much higher classification accuracy than the random sub-sampled sets,
so it is highly desirable to consider all the data. Since we implemented FastLSVM
without supporting data-structures for nearest neighbors, we must compute, for
all test points, the distances with all the training points, leading to a rather high
testing time.
The binary CoverType dataset allows us to make some comparison with state-
of-the-art SVM optimization approaches: CVM [11], and LaSVM [12]. Since
CVM and LaSVM have been tested on different hardware systems, but both
have been compared with LibSVM on a reduced training set of 100000 samples,
a fair comparison is possible normalizing all training times with the training
time of LibSVM on 100000 samples performed in the same work4 . Table 4 re-
ports the comparison, and it is clear that FastLSVM is sensibly faster than
LaSVM and CVM both on the reduced training set of 100000 samples and on
the complete dataset of more than 500000 samples. From the generalization ac-
curacy viewpoint, taking as reference the 100000 samples training set, we can
4
The LibSVM version used here is more recent than the version used in [11] and
LaSVM [12] and, since the last version is the fastest, the comparison can be a little
penalizing for FastLSVM.
308 N. Segata and E. Blanzieri
notice that FastLSVM is more accurate than LibSVM, whereas LaSVM is less
accurate than LibSVM (see [12]) and CVM seems to be as accurate as LibSVM
(see [11]).
4 Conclusions
References
1. Blanzieri, E., Melgani, F.: An adaptive SVM nearest neighbor classifier for remotely
sensed imagery. In: IEEE Int. Conf. on Geoscience and Remote Sensing Symposium
(IGARSS 2006), pp. 3931–3934 (2006)
2. Bottou, L., Vapnik, V.: Local learning algorithms. Neural Computation 4(6), 888–
900 (1992)
3. Vapnik, V.N., Bottou, L.: Local algorithms for pattern recognition and dependen-
cies estimation. Neural Computation 5(6), 893–909 (1993)
4. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg
(2000)
5. Blanzieri, E., Melgani, F.: Nearest neighbor classification of remote sensing images
with the maximal margin principle. IEEE Transactions on Geoscience and Remote
Sensing 46(6), 1804–1811 (2008)
6. Segata, N., Blanzieri, E.: Empirical assessment of classification accuracy of Lo-
cal SVM. In: The 18th Annual Belgian-Dutch Conference on Machine Learning
(Benelearn 2009) (2009) (accepted for publication)
Fast Local Support Vector Machines for Large Datasets 309
7. Brailovsky, V.L., Barzilay, O., Shahave, R.: On global, local, mixed and neighbor-
hood kernels for support vector machines. Pattern Recognition Letters 20(11-13),
1183–1190 (1999)
8. Zhang, H., Berg, A.C., Maire, M., Malik, J.: SVM-KNN: Discriminative nearest
neighbor classification for visual category recognition. In: Proc. of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2,
pp. 2126–2136 (2006)
9. Segata, N.: FaLKM-lib v1.0: a Library for Fast Local Kernel Machines. Tech-
nical report, number DISI-09-025. DISI, University of Trento, Italy (2009),
http://disi.unitn.it/~ segata/FaLKM-lib
10. Cheng, H., Tan, P.N., Jin, R.: Localized Support Vector Machine and Its Efficient
Algorithm. In: Proc. SIAM Intl. Conf. Data Mining (2007)
11. Tsang, I.W., Kwok, J.T., Cheung, P.M.: Core Vector Machines: Fast SVM Training
on Very Large Data Sets. The Journal of Machine Learning Research 6, 363–392
(2005)
12. Bordes, A., Ertekin, S., Weston, J., Bottou, L.: Fast kernel classifiers with online
and active learning. Journal of Machine Learning Research 6, 1579–1619 (2005)
13. Collobert, R., Bengio, S., Bengio, Y.: A parallel mixture of SVMs for very large
scale problems. Neural Computation 14(5), 1105–1114 (2002)
14. Collobert, R., Bengio, Y., Bengio, S.: Scaling Large Learning Problems with Hard
Parallel Mixtures. International Journal of Pattern Recognition and Artificial In-
telligence 17(3), 349–365 (2003)
15. Yu, H., Yang, J., Han, J., Li, X.: Making SVMs Scalable to Large Data Sets using
Hierarchical Cluster Indexing. Data Mining and Knowledge Discovery 11(3), 295–
321 (2005)
16. Dong, M., Wu, J.: Localized Support Vector Machines for Classification. In: Inter-
national Joint Conference on Neural Networks, IJCNN 2006, pp. 799–805 (2006)
17. Zanni, L., Serafini, T., Zanghirati, G.: Parallel Software for Training Large Scale
Support Vector Machines on Multiprocessor Systems. The Journal of Machine
Learning Research 7, 1467–1492 (2006)
18. Dong, J.X., Krzyzak, A., Suen, C.Y.: Fast SVM training algorithm with decom-
position on very large data sets. IEEE Transaction on Pattern Analysis Machine
Intelligence 27(4), 603–618 (2005)
19. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining,
pp. 217–226. ACM, New York (2006)
20. Hsieh, C.J., Chang, K.W., Lin, C.J., Keerthi, S.S., Sundararajan, S.: A Dual Coor-
dinate Descent Method for Large-scale Linear SVM. In: Proceedings of the Twenty
Fifth International Conference on Machine Learning (ICML) (2008)
21. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297
(1995)
22. Schölkopf, B., Smola, A.J.: Learning with kernels: support vector machines, regu-
larization, optimization, and beyond. MIT Press, Cambridge (2002)
23. Bottou, L., Lin, C.J.: Support Vector Machine Solvers. Large-Scale Kernel Ma-
chines (2007)
24. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001),
http://www.csie.ntu.edu.tw/~ cjlin/libsvm
310 N. Segata and E. Blanzieri
25. Chang, Q., Chen, Q., Wang, X.: Scaling gaussian rbf kernel width to improve
svm classification. In: International Conference on Neural Networks and Brain,
ICNN&B 2005, October 13-15, vol. 1, pp. 19–22 (2005)
26. Chávez, E., Navarro, G., Baeza-Yates, R., Marroquı́n, J.L.: Searching in metric
spaces. ACM Computing Surveys (CSUR) 33(3), 273–321 (2001)
27. Beygelzimer, A., Kakade, S., Langford, J.: Cover Trees for Nearest Neighbor. In:
Proceedings of the 23rd International Conference on Machine learning, Pittsburgh,
PA, pp. 97–104 (2006)
28. Ridella, S., Rovetta, S., Zunino, R.: Circular backpropagation networks for classi-
fication. IEEE Transactions on Neural Networks 8(1), 84–97 (1997)
29. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers.
Neural Processing Letters 9(3), 293–300 (1999)