Training Data Selection For Support Vector Machine
Training Data Selection For Support Vector Machine
Training Data Selection For Support Vector Machine
Machines
?
Jigang Wang, Predrag Neskovic, and Leon N Cooper
1 Introduction
2 Related Background
Given a set of training data {(x1, y1), . . . , (xn, yn )}, where xi ∈ IRd and yi ∈
{−1, 1}, support vector machines seek to construct an optimal separating hy-
perplane by solving the following quadratic optimization problem:
1 X n
min hw, wi + C ξn (1)
w,b 2
i=1
subject to
X
n
0 ≤ αi ≤ C ∀i = 1, . . . , n and αiyi = 0 . (4)
i=1
Solving the dual problem, one obtains the multipliers αi , i = 1, . . ., n, which give
w as an expansion
Xn
w= αi yi xi . (5)
i=1
To train a SVM classifier, one therefore needs to solve the dual quadratic
programming problem (3) under the constraints (4). For a small training set,
standard QP solvers, such as CPLEX, LOQO, MINOS and Matlab QP routines,
can be readily used to obtain the solution. However, for a large training set, they
quickly become intractable because of the large memory requirements and the
enormous amounts of training time involved. To alleviate the problem, a number
of solutions have been proposed by exploiting the sparsity of the SVM solution
and the KKT conditions.
The first such solution, known as chunking [13], uses the fact that only the
support vectors are relevant for the final solution. At each step, chunking solves
a QP problem that consists of all non-zero Lagrange multipliers αi from the last
step and some of the αi that violate the KKT conditions. The size of the QP
problem varies but finally equals the number of non-zero Lagrange multipliers.
At the last step, the entire set of non-zero Lagrange multipliers are identified and
the QP problem is solved. Another solution, proposed in [14], solves the large
QP problem by breaking it down into a series of smaller QP sub-problems. This
decomposition method is justified by the observation that solving a sequence of
QP sub-problems that always contain at least one training example that violates
the KKT conditions will eventually lead to the optimal solution. Recently, a
method called sequential minimal optimization (SMO) was proposed by Platt
[15], which approaches the problem by iteratively solving a QP sub-problem of
size 2. The key idea is that a QP sub-problem of size 2 can be solved analytically
without invoking a quadratic optimizer. This method has been reported to be
several orders of magnitude faster than the classical chunking algorithm.
All the above training methods make use of the whole training set. However,
according to the KKT optimality conditions, the final separating hyperplane is
fully determined by the support vectors. In many real-world applications, the
number of support vectors is expected to be much smaller than the total number
of training examples. Therefore, the speed of SVM training will be significantly
improved if only the set of support vectors is used for training, and the solution
will be exactly the same as if the whole training set was used.
In theory, one has to solve the full QP problem in order to identify the sup-
port vectors. However, it is easy to see that the support vectors are training
examples that are close to decision boundaries. Therefore, if there exists a com-
putationally efficient way to find a small set of training data such that with high
probability it contains the desired support vectors, the speed of SVM training
will be improved without degrading the generalization performance. The size
of the reduced training set can still be larger than the set of desired support
vectors. However, as long as its size is much smaller than the size of the total
training set, the SVM training speed will be significantly improved because most
SVM training algorithms scales quadratically on many problems [4]. In the next
section, we propose two new data selection strategies to explore the possibility.
Our second data selection strategy is based on the Hausdorff distance. In the
separable case, it has been shown that the optimal SVM separating hyperplane
is identical to the hyperplane that bisects the line segment which connects the
two closest points of the convex hulls of the positive and of the negative training
examples [16, 17]. The problem of finding the two closest points in the convex
hulls can be formulated as
min kz + − z − k2 (7)
z + ,z −
subject to X X
z+ = α i xi and z− = α i xi , (8)
i:yi =1 i:yi =−1
P P
where αi ≥ 0 satisfies the constraints i:yi =1 αi = 1 and i:yi =−1 αi = 1.
Based on this geometrical interpretation, the support vectors are the training
examples that are vertices of the convex hulls that are closest to the convex hull
of the training examples from the opposite class. For the non-separable case, a
similar result holds by replacing the convex hulls with the reduced convex hulls
[16, 17]. Therefore, a good heuristic that can be used to determine whether a
training example is likely to be a support vector is the distance to the convex
hull of the training examples of the opposite class. Computing the distance from
a training example xi to the convex hull of the training examples of the opposite
class involves solving a smaller quadratic programming problem. To simplify
the computation, the distance from a training example to the closest training
examples of the opposite class can be used as an approximation. We denote the
minimal distance as
d(xi) = min kxi − xj k , (9)
j:yj 6=yi
which is also the Hausdorff distance between the training example xi and the set
of training examples that belong to a different class. To select a subset of training
examples, we sort the training set according to d(xi) and select examples with
the smallest Hausdorff distances d(xi) as the reduced training set. This method
will be referred to as the Hausdorff distance-based selection method.
Note that Eq. (10) has taken into account the class information and training
examples that are misclassified by the desired separating hyperplane will have
negative distances. According to the KKT conditions, support vectors are train-
ing examples that have relatively small values of distance f(xi ). We sort the
training examples according to their distances to the separating hyperplane and
select a subset of training examples with the smallest distances as the reduced
training set. This strategy, although impractical because one needs to solve the
full QP problem first, is ideal for comparison purposes as the distance from
a training example to the desired separating hyperplane provides the optimal
criterion for selecting the support vectors.
Table 1. Error rates of SVMs on the Breast Cancer dataset when trained with reduced
training sets of various sizes
From Table 1 one can see that a significant amount data can be removed
from the training set without degrading the performance of the resulting SVM
classifier. When more than 10% of the training data is selected, the confidence-
based data selection method outperforms the other two methods. Its performance
is actually as good as that of the method based on the desired SVM outputs.
The method based on the Hausdorff distance gives the worst results. When the
data reduction rate is high, e.g., when less than 10 percent of the training data
is selected, the results obtained by the Hausdorff distance-based method and
random sampling are much better than those based on the confidence measure
and the desired SVM outputs.
Table 2 shows the corresponding results obtained on the BUPA Liver dataset,
which consists of 345 examples, with each example having 6 attributes. The sizes
of the training and test sets in each iteration are 276 and 69, respectively. The
average number of support vectors is 222.2, which is 80.51% of the size of the
training sets. Interestingly, as we can see, the method based on the desired
SVM outputs has the worst overall results. When less than 80% of the data is
selected for training, the Hausdorff distance-based method and random sampling
have similar performance and outperform the methods based on the confidence
measure and the desired SVM outputs.
Table 3 provides the results on the Ionosphere dataset, which has a total of
351 examples, with each example having 34 attributes. The sizes of the training
and test sets in each iteration are 281 and 70, respectively. The average number
of support vectors is 159.8, which is 56.87% of the size of the training sets. From
Table 3 we see that the data selection method based on the desired SVM outputs
gives the best results when more than 20% of the data is selected. When more
than 50% of the data is selected, the results of the confidence-based method are
very close to the best achievable results. However, when the reduction rate is
high, the performance of random sampling is the best. The Hausdorff distance-
based method has the worst overall results.
An interesting finding of the experiments is that the performance of the
SVM classifiers deteriorates significantly when the reduction rate is high, e.g.,
when the size of the reduced training set is much smaller than the number of
the desired support vectors. This is especially true for data selection strategies
that are based on the desired SVM outputs and the proposed heuristics. On the
other hand, the effect is less significant for random sampling, as we have seen
Table 3. Results on the Ionosphere dataset
that random sampling usually has better relative performance at higher data
reduction rates. From a theoretical point of view, this is not surprising because
when only a subset of the support vectors is chosen as the reduced training set,
there is no guarantee that the solution of the reduced QP problem will still be
the same. In fact, if the reduction rate is high and the criterion is based on
the desired SVM outputs or the proposed heuristics, the reduced training set
is likely to be dominated by ’outliers’, therefore leading to worse classification
performance. To overcome this problem, we can remove those training examples
that lie far inside the margin area since they are likely to be ’outliers’. For the
data selection strategy based on the desired SVM outputs, it means that we can
discard part of the training data that has extremely small values of the distance
to the desired separating hyperplane (see Eq. (10)). For the methods based on
the confidence measure and Hausdorff distance, we can similarly discard the part
of the training data that has extremely small values of N (xi ) and the Hausdorff
distance.
In Table 4 we show the results of the proposed solution on the Breast Cancer
dataset. Comparing Tables 1 and 4, it is easy to see that, when only a very small
subset of the training data (compared to the number of the desired support vec-
tors) is selected for SVM training, removing training patterns that are extremely
close to the decision boundary according to the confidence measure or accord-
ing to the underlying SVM outputs significantly improves the performance of
the resulting SVM classifiers. The effect is less obvious for the methods based
on the Hausdorff measure and random sampling. Similar results have also been
observed on other datasets but will not be reported here due to the space limit.
5 Conclusion
In this paper we presented two new data selection methods for SVM training.
To analyze their effectiveness in terms of their ability to reduce the training data
Table 4. Results on the Breast Cancer dataset