Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing

Neurocomputing 394 (2020) 51–60
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Comparison of base classifiers for multi-label learning

Edward K. Y. Yapp a,∗, Xiang Li a, Wen Feng Lu b, Puay Siew Tan a
a
Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634 Singapore
b
Department of Mechanical Engineering, National University of Singapore, 9 Engineering Drive 1, Block EA, #07-08, 117575 Singapore
a r t i c l e i n f o a b s t r a c t
Article history: Multi-label learning methods can be categorised into algorithm adaptation, problem transformation and
Received 27 September 2019 ensemble methods. Some of these methods depend on a base classifier and the relationship is not well
Revised 20 November 2019
understood. In this paper the sensitivity of five problem transformation and two ensemble methods to
Accepted 28 January 2020
four types of classifiers is studied. Their performance across 11 benchmark datasets is measured using
Available online 6 February 2020
16 evaluation metrics. The best classifier is shown to depend on the method: Support Vector Machines
Communicated by Dr. Oneto Luca (SVM) for binary relevance, classifier chains, calibrated label ranking, quick weighted multi-label learning
and RAndom k-labELsets; k-Nearest Neighbours (k-NN) and Naïve Bayes (NB) for Hierarchy Of Multilabel
Keywords:
Classification classifiERs; and Decision Trees (DT) for ensemble of classifier chains. The statistical performance of a
Classifier classifier is also found to be generally consistent across the metrics for any given method. Overall, DT
Experimental comparison and SVM have the best performance–computational time trade-off followed by k-NN and NB.
Multi-label
© 2020 Elsevier B.V. All rights reserved.
Multilabel
1. Introduction can be found in [13]. A preliminary analysis was conducted in

[14] showing that the performance of a multi-label learning
Multi-label learning is a supervised learning problem where method depends on the choice of base classifier. However, the
each training example is associated with multiple labels (binary analysis was limited to just three datasets, three multi-label learn-
or multi-class). In contrast, the traditional problem involves learning methods and four evaluation metrics. An extensive comparison
ing from single-label data. The multi-label learning problem has of 12 multi-label learning methods using 16 evaluation metrics
attracted interest from a wide range of domains such as text clas- over 11 benchmark datasets was made in [12]; however, a single
sification [1–4]; scene classification [5] and annotation of images base classifier was assumed for all the problem transformation
[6]; emotion detection in music [7]; detection of semantic concepts methods and the ensemble methods RAndom k-labELsets (RAkEL)
in videos [8]; gene functional classification [9]; and more recently [15] and Ensemble of Classifier Chains (ECC) which may not be
recommendation of food trucks to customers based on their infor- optimal [16]. Similarly, a default base classifier is usually assumed
mation and preferences [10]. in the literature (see e.g. [4,16,17]).
The different methods for multi-label learning can be cate- In this paper we compare four base classifiers—k-nearest neigh-
gorised into [11,12]: algorithm adaptation, problem transformation bours, decision trees, Naïve Bayes and support vector machines—
and ensemble methods. Algorithm adaptation involves modifying across 11 datasets, 7 multi-label learning methods and 16 eval-
the algorithm to make multi-label predictions. In problem trans- uation metrics. The technical contributions are that, to the best
formation, the multi-label problem is transformed into single-label of our knowledge, this is the most extensive study of the per-
problems and the standard classifier is applied; the results are formance of multi-label learning methods to be conducted from
then transformed into multi-label predictions. Ensemble methods the perspective of the choice of base classifier—4928 possible
combine multiple algorithm adaptation or problem transformation combinations. These results allow us to come up with a set
methods to make a prediction. of robust recommendations on the best choice of base classi-
An excellent review of the paradigm formalism and algorithmic fier for each multi-label learning method independent of the
details of eight representative multi-label learning algorithms dataset and evaluation metric. Note that the scope of this pa-
per is limited to the study of the single-label base classifier, and
not the methods used to transform the multi-label problem into
∗
Corresponding author. one or more single-label problems. Algorithm adaptation meth-
E-mail addresses: [email protected] (E.K. Y. Yapp),
ods are not considered as they inherently do not depend on a
[email protected] (X. Li), [email protected] (W.F. Lu), [email protected]
star.edu.sg (P.S. Tan). base classifier. However, a comparison of such methods would be
https://doi.org/10.1016/j.neucom.2020.01.102
0925-2312/© 2020 Elsevier B.V. All rights reserved.
52 E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60
Table 1
Summary of the problem transformation (rows 1–5) and ensemble (rows 6 and 7) methods used in this study.
Methods [Ref.] Descriptions Advantages Disadvantages
Binary Relevance (BR) [11] Individual classifier for each label Simple Ignores label correlations
Classifier Chain (CC) [16] Extension of BR Models label correlations Performance depends on the order of
classifiers in the chain
Calibrated Label Ranking (CLR) [20] Ranking by pairwise comparison Mitigates class imbalance issue Number of classifiers scales quadratically
method with the number of labels
Quick Weighted Multi-label Learning Extension of CLR Reduces the number of classifiers that Number of classifiers that has to be stored
(QWML) [19] are evaluated to make a prediction is still quadratic
Hierarchy Of Multilabel classifiERs Label powerset method Computationally efficient for datasets Additional parameter to tune, i.e. the
(HOMER) [4] with a large number of labels number of clusters
RAndom k-labELsets (RAkEL) [15] Extension of the label powerset Improvement over label powerset Random nature may include models that
method method for a large number of labels affect ensemble in a negative way
and training examples
Ensemble of Classifier Chains (ECC) [16] Extension of CC Chain ordering is less likely to Potentially large redundancy in learning
negatively affect performance space
an interesting avenue of future research as they may have spe- with a label is a positive example for the label and negative for the
cific classifier advantages. We also developed one of the prob- calibration label, and vice versa. This in fact directly corresponds to
lem transformation methods (quick weighted multi-label learning the classifiers trained in BR.
[18,19]) without which the modelling of certain datasets was not However, during the testing phase, a quadratic number of clas-
possible. sifiers has to be evaluated. It is possible to evaluate a smaller sub-
The paper is organised as follows: Section 2 is a background on set of it to determine the label with the highest accumulated vot-
the multi-label learning methods used in this study. Section 3 pro- ing mass. The quick weighted algorithm excludes or exactly prunes
vides details on the evaluation metrics, datasets, and setup and the set of possible top-ranked labels, even if they reach the maxi-
method used. The key results and discussion are in Section 4 fol- mal voting mass in the remaining evaluations [18]. In the context
lowed by the conclusions in Section 5. The complete set of results of Quick Weighted Multi-label Learning (QMWL), there are two
may be found in the Supplementary Material. variants: (1) where the above process is repeated until the top la-
bel that is returned is the calibration label and all remaining labels
are irrelevant and (2) an improved version where evaluation of the
2. Background current top-ranked label is stopped once it has received a higher
voting mass than the calibration label [19]. The latter is the vari-
In this section, we provide a brief description of the five prob- ant that will be used in this paper.
lem transformation and two ensemble methods used in this study Hierarchy Of Multilabel classifiERs (HOMER) [4] was developed
which all depend on a base classifier. From here on, it is assumed to transform a large set of labels into a tree-shaped hierarchy of
that binary labels are assigned to each training example. A sum- simpler multi-label classification tasks, each dealing with a much
mary of each method as well as the advantages and disadvantages smaller number of labels and a more balanced example distribu-
may be found in Table 1. tion. In a top-down, depth-first approach starting with the root,
HOMER recursively and evenly distributes the labels into k children
2.1. Problem transformation methods nodes. This is achieved using a balanced k means algorithm which
is repeated for a number of iterations. A training example is anno-
Binary Relevance (BR) considers each label as an independent tated with a meta-label if it is annotated with at least one of the
binary problem and in a one-vs-all or one-vs-the-rest strategy one labels at the respective nodes. A classifier is then trained at each
classifier is fitted per label [11]. Classifier Chains (CC) is an ex- node, apart from the leaves (single label), for the meta-labels of its
tension of BR where classifiers are linked along a chain (of length children.
equal to the number of labels q) and the feature space of each link
is extended with the label relevances of all previous links. By pass- 2.2. Ensemble methods
ing information between classifiers it is capable of exploiting label
correlations unlike BR [16]. The label powerset method considers each subset of the la-
In Ranking by Pairwise Comparison (RPC), the multi-label bels in the training set as a different class of a single-label clas-
dataset is first transformed into q(q − 1 )/2 binary training sets, one sification problem. However, there are two limitations: (1) it is
for each label pair, where only training examples with distinct rel- unable to predict label sets not in the training set and (2) the
evance are included. Then in a one-vs-one strategy, one classifier computational complexity grows exponentially with the number
is constructed for each label pair. To predict the labels of a new of labels. RAndom k-labELsets (RAkEL) [15] constructs an ensem-
example, all the classifiers are invoked and the labels are ranked ble of m label powerset classifiers (addresses the first limita-
according to the votes received (soft or hard output). A threshold- tion) trained on a small random subset of size k of the label set
ing function bipartitions the list of ranked labels into relevant and (addresses the second limitation). The average binary decisions
irrelevant labels [21]. across the m classifiers for each label which exceed a threshold
It was argued that RPC does not have a natural “zero-point”; determines the final prediction (see [22], Table 1, for a worked
therefore, Calibrated Label Ranking (CLR) was proposed where example).
a calibration/virtual label is introduced which represents a split The order of classifiers in CC is random and could be poorly
point between relevant and irrelevant labels [20]. This approach ordered. Ensemble of Classifier Chains (ECC) [16] trains m clas-
also negates the need for a thresholding function. In addition to sifier chains, each with a random chain order, on a random
the q(q − 1 )/2 classifiers trained in the RPC step, q classifiers are subset of the training set (analogous to the label set subsam-
trained, one for each label, where each example which is annotated pling in RAkEL). Sampling can be performed without replace-
E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60 53
ment [16], or with replacement [23] using the bagging scheme classified examples, i.e. the predicted label set is identical to the
for higher predictive performance [24] as will be used in this true label set [27]:
paper.
1
p
subsetAccuracy = h(xi ) = Yi , (6)
p
i=1
3. Methodology
where the Iverson bracket P converts the logical proposition P to
In this section, the evaluation metrics used to assess the perfor- 1 if the proposition is satisfied, and 0 otherwise.
mance of the multi-label learning methods are presented, followed
by the datasets and the computational setup used.
3.1.2. Label-based metrics
There are two ways of averaging across the labels [28]:
3.1. Evaluation metrics Macro-averaging:
1
q
The metrics are categorised into: (1) example-based where the
Bmacro = B ( T P j , F P j , F N j ), (7)
metric is first computed for each example across the labels, then q
j=1
averaged across all the examples, (2) label-based where the metric
is first computed for each label across the examples, then averaged Micro-averaging:
across all the labels, and (3) ranking-based as a confidence is as-
sociated with the prediction of each label which may be used to
q

q

q
rank the labels. Bmicro = B T Pj , F Pj , F Nj , (8)

j=1 j=1 j=1
where B is the precision, recall or F1 as defined earlier in Eqs. (3)–

3.1.1. Example-based metrics
(5).
Hamming loss is the fraction of misclassified example-label
pairs, i.e. a relevant label is missed or an irrelevant label is pre-
dicted [25]: 3.1.3. Ranking-based metrics
One error evaluates the fraction of examples where the top-
11
p
HammingLoss = |h(xi )Yi |, (1) ranked label is not in the set of relevant labels [25]:
p q
i=1
1
p
oneError = [arg max f (xi , y )] ∈
/ Yi , (9)
where p is the number of examples, q is the total number of pos- p y∈Y
i=1
sible labels, h(xi ) is the set of labels predicted by the classifier h
for the example xi , and Yi is the set of labels associated with xi . |X| where the classification function f(x, y) returns the confidence of y
is the cardinality or the number of elements in the set X. is the being a proper label of x.
symmetric difference of the two sets or the set of elements which Coverage is a measure of the number of steps required, on av-
are in either of the sets and not in their intersection. erage, to go down the list of ranked labels in order to cover all the
Accuracy (or Jaccard index) compares the similarity and diver- relevant labels [25]:
sity of the predicted and true label sets [26]:
1
p
coverage = max rank f (xi , y ) − 1, (10)
1 1 |h(xi ) ∩ Yi |
p p
T Pi p y∈Yi
accuracy = = , (2) i=1
p T Pi + F Pi + F Ni
i=1
p |h(xi ) ∪ Yi |
i=1
where rankf (x, y) is the rank of y sorted in descending order
where TP, FP and FN are the number of true positives, false posi- based on confidences from f(x, y). As there are different ways to
tives and false negatives, respectively.1 rank labels with the same confidence, we provide the definition in
Precision is the ratio of true positives to the number of positive Algorithm 1.3 Note that multiple labels with the same confidence
predictions across all labels [26]:
Algorithm 1 Convert predictions to ranking
1 1 |h(xi ) ∩ Yi |
p p
T Pi
precision = = . (3) Input: f (x, y )
p T Pi + F Pi
i=1
p |h ( xi )|
i=1 Output: rank f (x, y )
Recall is the ratio of true positives to the sum of true positives
1: function predictionstoranking( f (x, y ))
and false negatives (FN) across all labels [26]:
2: rank f (x, y ) ← length equal to the number of labels q
for i = 0 to q do
1 1 |h(xi ) ∩ Yi |
p p 3:
T Pi
recall = = . (4) 4: counter ← 1
p T Pi + F Ni
i=1
p |Yi |
i=1 5: for j = 0 to q do
6: if f (x, yi ) ≤ f (x, y j ) then
F1 is the harmonic mean of precision and recall [26]:
7: counter ← counter + 1
1 2 × precision × recall
p
1 2 × |h(xi ) ∩ Yi |
p 8: rank f (x, yi ) ← counter − 1
F1 = = . (5) return rank f (x, y )
p
i=1
precision + recall p |h(xi )| + |Yi |
i=1
9:
Subset accuracy (or exact match) is the fraction of correctly

are assigned the same rank and this algorithm is conservative in
assigning the minimum possible rank to the labels. The purpose of
1
Note that the accuracy metric in the MEKA and MULAN software packages tends the -1 term on the right-hand side of Eq. (10) is so that the cover-
to be conservative as the number of true negatives is neglected. age is 0 in the limiting case of no classification errors.
Table 2
Multi-label datasets and various statistics. Rows 1–5 correspond to simpler datasets where 10-fold cross validation was used, while rows
6–11 correspond to datasets which are more complex and a train-test split was used instead.
Examples Features Label Label Distinct
Name Domain Train Test Nominal Numeric Labels cardinality density proportion Complexity
emotions Audio 391 202 0 72 6 1.87 0.311 0.046 1.69E5

scene Image 1221 1196 0 294 6 1.07 0.179 0.006 2.14E6
yeast Biology 1500 917 0 103 14 4.24 0.303 0.082 2.16E6
medical Text 645 333 103 0 45 1.25 0.028 0.096 4.21E7
enron Text 1123 579 1001 0 53 3.38 0.064 0.442 5.96E7
tmc2007 Text 21,519 7077 500 0 22 2.22 0.101 0.041 2.37E8
mediamill Video 30,933 12914 0 120 101 4.38 0.043 0.149 3.76E8
corel5k Image 4500 500 499 0 374 3.52 0.009 0.635 8.40E8
bibtex Text 4880 2515 1836 0 159 2.40 0.015 0.386 1.42E9
delicious Text 12,920 3185 500 0 983 19.02 0.019 0.981 6.35E9
bookmarks Text 60,000 27856 2150 0 208 2.03 0.010 0.213 2.68E10
Ranking loss is the average fraction of reversely-ordered label label sets is [16]:
pairs [25]:
1
proportionLabelDiversity = × l abel Diversity
p
1 1
p
rankingLoss = |{(y , y ) | f (xi , y ) |{Y | ∃x : (x, Y ) ∈ D}|
p
i=1
|Yi ||Ȳi | = . (16)
p
≤ f (xi , y ), (y , y ) ∈ Yi × Ȳi }|, (11)
The datasets have been sorted according to their complex-
ity (defined as the product of the number of training examples,
given that y ∈ Yi and y ∈ Ȳi (complementary set of irrelevant
features and labels [16]) in increasing order from top to bot-
labels). The classification function f misorders the pair if f(xi ,
tom. Note that all of the datasets, with the exception of book-
y ) ≤ f(xi , y ).
marks, have been pre-split into training and test sets with an
Average precision evaluates the average fraction of relevant la-
approximate 2:1 ratio in most cases. For the bookmarks dataset,
bels ranked above a particular relevant label y ∈ Yi [25]:
the first 60 0 0 0 examples were taken as the training set while
the remaining 27856 examples were taken as the test set [12].
1 1
p
averageP recision = This was performed as the original dataset (tas: contains the
p |Yi |
i=1 y∈Yi
tags that a particular user has assigned to a particular item)
from which the bookmarks and bibtex datasets are derived is
|{y | rank f (xi , y ) ≤ rank f (xi , y ), y ∈ Yi }| not available [3]. A shorter version of the tmc2007 dataset was
× .
rank f (xi , y ) used, where the top 500 of the original 49060 features were
(12) selected [2]. Further details of each dataset can be found in
[12].
In summary, lower values are desired for hammingLoss, oneError,
coverage and rankingLoss in Eqs. (1) and (9) – (11), respectively; and 3.3. Setup and method
higher values for the rest.
All experiments were performed on an Intel Xeon Phi 7210
with Intel KNL (Knights Landing) architecture and RedHat Enter-
3.2. Datasets prise Linux 7 operating system. Each experiment was allowed up
to 128 GB RAM and 14 days of wall clock time.
The multi-label datasets used in this study2 and their associ- All of the algorithms were evaluated using the following li-
ated statistics are shown in Table 2. Label cardinality is the average braries under Java JDK 10.0.2. MULAN2 is a Java library for multi-
number of labels per example [11]: label learning and only provides an application programming in-
terface [30]—release 1.5.0 was used for CLR, QWML, HOMER and
RAkEL. WEKA is a data mining software implemented in Java
1
p
l abelCardinal ity = |Yi |. (13) and provides both graphical and command-line interfaces [31]—
p development 3.9.2 was used for the base classifiers k-Nearest
i=1
Neighbours (k-NN) [32], Decision Trees (DT) [33], Naïve Bayes
Label density is the label cardinality normalised by the total (NB) [34] and Support Vector Machines (SVM) [35]. MEKA3 is a
number of possible labels [11]: multi-label/multi-target extension to WEKA [36] and provides a
wrapper around MULAN—release 1.9.3 was used for BR, CC and
1
q
1 ECC.
l abel Density = × l abelCardinal ity = |Yi |. (14) In QWML, pairwise classifiers are selected based on the amount
q pq
i=1
of potential voting mass a label has not received, or the voting
loss defined as li := pi − vi where pi is the number of evaluated
Label diversity, or the number of distinct label sets is [29]:
classifiers and vi is the number of votes received by label yi . Two
l abel Diversity = |{Y | ∃x : (x, Y ) ∈ D}|, (15)
2
https://github.com/tsoumakas/mulan.
where D is the set of training examples. The proportion of distinct 3
https://github.com/Waikato/meka.
Table 3
The multi-label learning methods (rows 1–7) and base classifiers (rows 8–11) and their associated parameters used in this work.
Classifiers Parameters (values)
Binary Relevance (BR) None

Classifier Chains (CC) None
Calibrated Label Ranking (CLR) Type of output (voteLabels ∈ {0, 1})
Quick Weighted Multi-label Learning (QWML) Type of output (voteLabels ∈ {0, 1})
Hierarchy Of Multilabel classifiERs (HOMER) Number of clusters (3); maximum number of iterations (10)
RAndom k-labELsets (RAkEL) Number of models (k = min(2q, 100 )a ); size of the label set (m = q/2)
Ensemble of Classifier Chains (ECC) Ensemble iterations (10); bag size (100%)
k-Nearest Neighbours (k-NN) Number of neighbours to use (1); distance function (Euclidian)
Decision Trees (DT) Confidence factor (0.25); minimum number of instances per leaf (2)
Naïve Bayes (NB) Use both nominal and numeric features; assume normal distribution for numeric features
Support Vector Machines (SVM) Complexity parameter (1.0); kernel (polynomial with exponent 1)
a
Except for the mediamill, delicious and bookmarks datasets where k = min(2q, 10 ).
changes had to be made to the algorithm without which the mod- Instead of fine tuning the hyperparameters of the base clas-
elling of certain datasets was not possible. First, in the case that no sifiers to optimise the performance of the multi-label learning
data exist for the one-vs-one learning between labels ya and yb , i.e. method, default values were used as shown in Table 3. This enables
no distinct relevances, the number of possible evaluated classifiers a fair comparison amongst the methods and the hyperparameters
is no longer equal to the number of labels. It is tracked separately are supposed to generalise across the different applications [37].
for each label and adjusted accordingly. Second, an explicit check Note that this implies that a linear kernel was used for SVM as
is made that all of the possible classifiers have been evaluated. The was used in [16,23].
modified QWML method is shown in Algorithm 2. A thresholding function bipartitions the list of ranked labels
into relevant and irrelevant labels. With the exception of CLR and
QWML which do not rely on a thresholding function, the threshold
Algorithm 2 Modified Quick Weighted Multi-label Learning
is automatically calibrated to minimise the difference between the
(QWML)
label cardinality of the training set and the predictions on the test
Input: example x; classifiers {hu,v | u < v, yu , yv ∈ Y }; losses set [23].
l0 , . . . , lq = 0; number of possible evaluated classifiers
n0 , . . . , nq = q
1: v0 ← 0, P ← ∅
4. Results and discussion
2: for j = 0 to q do
3: l j ← h0, j (x ) Losses based on classifiers of calibrated label
The complete set of results for the different datasets, multi-
y0
label learning methods and base classifiers across the different
4: v0 ← v0 + ( 1 − l j ) Votes for y0
evaluation metrics (including training time) may be found in Sup-
5: repeat plementary Material Tables S1 to S17. Unless stated otherwise, the
6: while ytop not determined do discussion is based only on the datasets where a complete set of
7: ya ← arg miny j ∈Y l j results is obtained for all multi-label learning methods and base
8: if pa < na then Check of the number of evaluated classifiers.
classifiers of ya
9: yb ← arg minyi ∈Y \{ya } l j and ha,b not yet evaluated
10: if va ≥ v0 or no yb exists then 4.1. Experiment 1: relationship between base classifier and size of
11: ytop ← ya dataset
12: else
13: if ha,b exists then We would like to understand how the different base classifiers
14: vab ← ha,b (x ) perform. Ref. [12] argues that the performance of the classifier
15: va ← va + vab is related to the size of the dataset where SVM-based methods
16: vb ← vb + (1 − vab ) perform better for smaller datasets while DT-based methods per-
17: la ← la + (1 − vab ) form better for larger datasets. However, the base classifier was
18: lb ← lb + vab not systematically varied; therefore, the effect of the multi-label
19: else No distinct relevances learning method was not separated from the base classifier. To fur-
20: na ← na − 1 ther test this hypothesis, we rank the base classifiers across all the
21: nb ← nb − 1 multi-label learning methods and evaluation metrics, and compare
22: if vtop ≥ v0 then the average rankings for small, medium and large datasets as de-
23: P ← P ∪ ytop fined by the varying orders of magnitude of examples. The results
24: ltop ← +∞ are shown in Fig. 1 where the capped bars represent ± two
25: until vtop ≥ v0 and |P | < n standard deviations around the average ranking. There is indeed
26: return P a clear trend where the ranking of SVM-based methods improves
from large to small datasets; vice versa for DT-based methods. The
same cannot be said for k-NN- and NB-based methods: k-NN-based
10-fold cross validation was performed on the emotions, scene, methods display a similar trend as DT-based methods, but do not
yeast, medical and enron datasets; however, a simple train-test perform particularly well for medium datasets; there is no signifi-
split was used for the remaining datasets as the computational ex- cant difference in ranking between small and large datasets for NB-
pense of these simulations was too expensive. based methods, but the ranking for medium datasets is the best.
Fig. 1. Rankings of base classifiers for datasets of varying orders of magnitude of examples, where a complete set of results was obtained for the multi-label learning
methods and base classifiers.
(a) Binary Relevance (b) Classifier Chains
(c) Calibrated Label Ranking (d) Quick Weighted Multi-label Learning
(e) Hierarchy Of Multilabel classifiERs (f) RAndom k-labELsets
(g) Ensemble of Classifier Chains

Fig. 2. Comparison of all base classifiers against each other with the Nemenyi post-hoc test. The test is performed based on completed datasets for the different multi-label
learning methods and for the accuracy metric. Groups of classifiers that are not significantly different (at a confidence level of 0.05) are connected by the thick line; CD is
the critical difference.
4.2. Experiment 2: relationship between base classifier and

multi-label learning method
Next, independent of the dataset, can the performance of the
NB {k-NN, DT, SVM}

different base classifiers be related to the multi-label learning
DT {k-NN, NB}
DT {k-NN, NB}
{DT, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
method?
SVM NB
SVM NB
DT NB
DT NB
DT NB
DT NB
To explore this question we perform the corrected Friedman
None
None
None
ECC
test [38,39] followed by the Nemenyi post-hoc test [40] for each
multi-label learning method to detect whether there is a signif-
icant difference between the base classifiers across the datasets;
if so, where does this difference lie. Fig. 2 compares the average
ranking of the base classifiers at a confidence level of 0.05 for the
k-NN {DT, SVM}; NB SVM

{DT, SVM} k-NN; DT NB
different multi-label learning methods and for the accuracy met-
ric as an example. The rankings of two classifiers are significantly
SVM {k-NN, NB}
SVM {k-NN, NB}

SVM {k-NN, NB}
SVM {k-NN, NB}
SVM {k-NN, NB}
SVM {k-NN, NB}
SVM {k-NN, NB}
DT {k-NN, NB}
DT {k-NN, NB}
DT {k-NN, NB}
different if they differ by at least the critical difference defined as
SVM NB
SVM NB
SVM NB
Nemenyi post-hoc test at a 0.05 confidence level based on completed datasets for the different multi-label learning methods across the different evaluation metrics.
[40]:
RAkEL
None
None

k (k + 1 )
criticalDif f erence = qα , (17)
6N
where qα is the critical value at the confidence level α , k is the
number of base classifiers and N is the number of datasets. (q0.05
= 2.569 and q0.10 = 2.291 for k = 4 for the two-tailed Nemenyi
{k-NN, DT, NB} SVM

{k-NN, DT, NB} SVM
{k-NN, DT, NB} SVM
SVM
SVM
SVM
SVM
{k-NN, DT, NB} SVM

{k-NN, DT, NB} SVM
{k-NN, NB} SVM
{k-NN, DT} SVM
{k-NN, DT} SVM

test.) Table 4 shows a summary of the results for all the evalua-
NB {DT, SVM}
k-NN SVM
tion metrics where “None” indicates that the post-hoc test is not
HOMER
{k-NN, DT, NB}
{k-NN, DT, NB}

{k-NN, DT, NB}
{k-NN, DT, NB}

None
None
None
powerful enough to detect any significant differences between the
base classifiers. Given four classifiers a, b, c and d, {a, b}{c} indi-
cates that: (1) classifiers a and b are significantly better than c, (2)
there is no significant difference between a and b, and (3) there is
insufficient data to reach any conclusion about d.
{k-NN, SVM} NB
DT {k-NN, SVM}
We can make the following observations: for any given multi-
label learning method, the performance of the base classifier is SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
NB DT
QWML
None
None
None
None
None
None
generally consistent across all the evaluation metrics, i.e., a base
classifier which is statistically superior for a given evaluation met-
ric tends to also do better in terms of another metric. This gives
some assurance as comparisons in the literature are usually based
on a few popular metrics such as Hamming loss and accuracy. Sec-
{DT, SVM} NB
ond, there is a direct relationship between the base classifier and
NB k-NN
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
the multi-label learning method: SVM is the best classifier for BR,
None
None
None
None
None
None
CLR
CC, CLR, QWML and RAkEL; k-NN and NB for HOMER (DT has a
significantly longer training time); and DT for ECC. The results at a
confidence level of 0.1 (see Table 5) are similar.
{k-NN, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
NB {DT, SVM}
4.3. Experiment 3: general results
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
NB DT
NB DT
NB DT
None
None
None
None
None
CC
Based on the results in Table 4, a similar technique as [41] is

employed to determine the overall score of each base classifier
across all the multi-label learning methods. The overall score is de-
fined as the number of wins (significantly better results) less the
number of losses. As shown in Table 6, DT has the best score across
NB {k-NN, DT, SVM}
SVM {k-NN, NB}
SVM {k-NN, NB}
SVM {k-NN, NB}
SVM {k-NN, NB}
NB {k-NN, SVM}
all the metrics but the tradeoff is that it has the lowest score for
training time. Conversely, NB has the fastest training time but has
None
None
None
None
None
None
None
None
None
None
None
BR
the worst performance. This is not surprising given the simplis-

tic assumptions that features are independent and normally dis-
tributed which generally do not hold in practice. Despite SVM be-
ing the best classifier for most of the multi-label learning meth-
ods, its performance score is lower than DT. This is because it is
Average precision
penalised for its performance with HOMER; k-NN’s performance–

Macro precision
Subset accuracy
Micro precision
Hamming loss
training time trade-off is somewhere between DT and SVM. Dif-

Example-based
Ranking-based
Macro recall
Micro recall
Training time
ferences between base classifiers are more obvious at a confidence

Label-based
One error
Rank loss
Precision
Coverage
Macro F1
Accuracy
Micro F1
level of 0.1 as shown in Table 7.

Recall
Metrics
Table 4
The delicious and bookmarks datasets have the highest com-

F1
plexity, and not all of the simulations were able to finish within
the given memory and time constraints. The above analysis was
repeated for all datasets where a ranking was only assigned to sim-
ulation runs which finished—similar results were found.
58
Table 5
Nemenyi post-hoc test at a 0.1 confidence level based on completed datasets for the different multi-label learning methods across the different evaluation metrics.
Metrics BR CC CLR QWML HOMER RAkEL ECC
E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60

Example-based
Accuracy SVM NB SVM NB SVM {k-NN, NB} SVM NB {k-NN, DT, NB} SVM SVM {k-NN, NB} {DT, SVM} NB
Hamming loss SVM {k-NN, NB} {k-NN, DT, SVM} NB SVM {k-NN, NB} {DT, SVM} NB {k-NN, DT, NB} SVM SVM {k-NN, NB} {DT, SVM} NB
Precision SVM {k-NN, NB} SVM NB None {DT, SVM} NB {k-NN, DT, NB} SVM SVM {k-NN, NB} SVM {k-NN, NB}; DT
NB
Recall None NB DT SVM NB None None SVM {k-NN, NB} DT {k-NN, NB}
Subset accuracy SVM {DT, NB} {k-NN, SVM} NB None {k-NN, SVM} NB {k-NN, DT, NB} SVM SVM NB {k-NN, SVM} NB
F1 SVM NB SVM NB SVM {k-NN, NB} {DT, SVM} NB {k-NN, DT, NB} SVM SVM {k-NN, NB} {DT, SVM} NB
Label-based
Macro precision SVM k-NN {k-NN, SVM} NB SVM NB SVM NB {k-NN, DT, NB} SVM SVM NB None
Macro recall None NB DT None None None None None
Macro F1 None None None SVM NB {k-NN, DT, NB} SVM SVM NB None
Micro precision SVM {k-NN, NB} {DT, SVM} NB None SVM NB {k-NN, DT, NB} SVM SVM {k-NN, NB} {DT, SVM} NB
Micro recall None NB DT SVM NB None None SVM {k-NN, NB} DT {k-NN, NB}
Micro F1 SVM {k-NN, NB} SVM NB SVM {k-NN, NB} SVM NB {k-NN, DT, NB} SVM SVM {k-NN, NB}; DT {DT, SVM} NB
NB
Ranking-based
One error None SVM NB SVM {k-NN, NB}; DT SVM {k-NN, NB}; DT k-NN SVM {DT, SVM} k-NN; {DT, {DT, SVM} NB
NB NB SVM} NB
Coverage NB {k-NN, SVM}; DT None SVM {k-NN, DT, NB} DT {k-NN, SVM} {k-NN, DT, NB} SVM DT {k-NN, NB, SVM} DT {k-NN, NB}
SVM
Rank loss NB k-NN None SVM {k-NN, NB}; DT DT k-NN {k-NN, DT, NB} SVM DT {k-NN, NB} DT {k-NN, NB}
NB
Average None SVM NB SVM NB None {k-NN, DT, NB} SVM {DT, SVM} k-NN; DT DT {k-NN, NB}; SVM
precision NB NB
Training time NB {k-NN, DT, SVM} NB {k-NN, DT, SVM} NB k-NN NB {DT, SVM} {k-NN, NB} DT; NB k-NN {DT, SVM}; NB NB {k-NN, DT, SVM}
SVM SVM
Table 6 smaller datasets. k-NN- and NB-based methods do not show

Overall scores from the Nemenyi post-hoc test at a 0.05 confidence level based on
a clear dependence on the size of the dataset.
completed datasets for the different base classifiers across the different evaluation
metrics. (2) Independent of the dataset, there is a correlation between
the base classifier and multi-label learning method. SVM is
Metric k-NN DT NB SVM
the best classifier for BR, CC, CLR, QWML and RAkEL; k-NN
Example-based and NB for HOMER; and DT for ECC.
Accuracy 0 2 -5 3 (3) A performance–training time trade-off is observed for the
Hamming loss -1 3 -7 5
base classifiers. While DT is a good overall base classifier,
Precision -1 2 -6 5
Recall -2 1 -2 3 the recommended choice depends on the application. Where
Subset accuracy 3 1 -5 1 computational speed is a concern (and some performance
F1 0 2 -5 3 can be sacrificed) as in online machine learning, we recom-
Label-based
mend NB; where performance is important and one can af-
Macro precision 1 1 -2 0
Macro recall 0 -1 1 0 ford to train a model for days to weeks, we recommend SVM
Macro F1 1 1 1 -3 (except for HOMER).
Micro precision -1 2 -5 4
Micro recall -1 1 -2 2 For datasets which are large (O (105 ) examples) and have a high
Micro F1 -1 2 -5 4 complexity, many of the state-of-the-art methods studied are not
Ranking-based able to finish even within two weeks of wall clock time. A future
One error -1 3 -5 3
area of research is improving the computational speed of these
Coverage -2 5 0 -3
Rank loss 0 5 -3 -2 algorithms. Approaches such as the use of deep belief networks
Average precision 0 4 -3 -1 [37] to reduce the complexity of the feature space representation
Training time -1 -6 13 -6 is a step in the right direction.
Total -6 28 -40 18
Declaration of Competing Interest

Table 7
Overall scores from the Nemenyi post-hoc test at a 0.1 confidence level based on The authors declare that they have no known competing finan-
completed datasets for the different base classifiers across the different evaluation cial interests or personal relationships that could have appeared to
metrics. influence the work reported in this paper.
Metric k-NN DT NB SVM
CRediT authorship contribution statement
Example-based
Accuracy -1 2 -6 5
Hamming loss -1 4 -9 6 Edward K. Y. Yapp: Conceptualization, Methodology, Software,
Precision -2 3 -6 5 Validation, Formal analysis, Investigation, Data curation, Writing -
Recall -2 1 -2 3 original draft, Visualization, Project administration. Xiang Li: Writ-
Subset accuracy 4 0 -7 3
ing - review & editing, Supervision. Wen Feng Lu: Writing - review
F1 -1 3 -7 5
Label-based & editing. Puay Siew Tan: Writing - review & editing, Supervision,
Macro precision 1 1 -4 2 Funding acquisition.
Macro recall 0 -1 1 0
Macro F1 1 1 -1 -1
Acknowledgments
Micro precision -1 3 -6 4
Micro recall -2 1 -2 3
Micro F1 -2 3 -7 6 The authors would like to thank Peter Reutemann and Jesse
Ranking-based Read for their support in the development of MEKA. This work was
One error -3 5 -9 7 supported by the A∗ STAR Computational Resource Centre through
Coverage -4 8 0 -4
the use of its high performance computing facilities; and the SERC
Rank loss -4 7 -2 -1
Average precision -2 5 -4 1 Strategic Funding (A1718g0040).
Training time -1 -7 15 -7
Total -20 39 -56 37 Supplementary material
Supplementary material associated with this article can be

found, in the online version, at doi:10.1016/j.neucom.2020.01.102.
5. Conclusions
References
The performance of seven multi-label learning methods in rela-
tion to four types of base classifiers was studied using 11 bench- [1] B. Klimt, Y. Yang, The Enron corpus: A new dataset for email classification re-
mark datasets and 16 evaluation metrics. The corrected Friedman search, in: J. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.), Machine
Learning: ECML. Lecture Notes in Computer Science, volume 3201, Springer,
test with the corresponding Nemenyi post-hoc test were used for
Berlin, Heidelberg, 2004, pp. 217–226, doi:10.1007/978- 3- 540- 30115- 8_22.
comparison of base classifiers over multiple datasets; statistical [2] A.N. Srivastava, B. Zane-Ulman, Discovering recurring anomalies in text re-
significance was determined at the 0.05 and 0.1 confidence lev- ports regarding complex space systems, in: Proceedings of the IEEE Aerospace
Conference, Big Sky, MT, USA, 2005, pp. 3853–3862, doi:10.1109/AERO.2005.
els. As not all the methods were able to finish within the given
1559692.
memory and time constraints, the analysis was first performed for [3] I. Katakis, G. Tsoumakas, I. Vlahavas, Multilabel text classification for auto-
only datasets where a complete set of results was obtained from mated tag suggestion, in: Proceedings of the ECML/PKDD Discovery Challenge,
all methods (and base classifiers) then across all datasets. The fol- Antwerp, Belgium, 2008, pp. 1–9.
[4] G. Tsoumakas, I. Katakis, I. Vlahavas, Effective and efficient multilabel clas-
lowing conclusions are reached: sification in domains with large number of labels, in: Proceedings of the
ECML/PKDD Workshop on Mining Multidimensional Data - MMD’08, Antwerp,
(1) Base classifiers show some dependence on the size of Belgium, 2008, pp. 1–15.
[5] M.R. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classifi-
the dataset. DT-based methods perform better for larger cation, Pattern Recogn. 37 (9) (2004) 1757–1771, doi:10.1016/j.patcog.2004.03.
datasets while SVM-based methods perform better for 009.
[6] P. Duygulu, K. Barnard, J.F.G. de Freitas, D.A. Forsyth, Object recognition as ma- [30] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas, MULAN: A Java li-
chine translation: Learning a lexicon for a fixed image vocabulary, in: A. Hey- brary for multi-label learning, J. Mach. Learn. Res. 12 (2011) 2411–2414.
den, G. Sparr, M. Nielsen, P. Johansen (Eds.), Proceedings of the Computer [31] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The
Vision — ECCV. Lecture Notes in Computer Science, volume 2353, Springer, WEKA data mining software, ACM SIGKDD Explorations Newsletter 11 (1)
Berlin, Heidelberg, 2002, pp. 97–112, doi:10.1007/3- 540- 47979- 1_7. (2009) 10–18, doi:10.1145/1656274.1656278.
[7] K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas, Multi-label classification of [32] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach.
music by emotion, EURASIP J. Audio, Speech Music Process. 2011 (1) (2011) 4, Learn. 6 (1) (1991) 37–66, doi:10.1023/A:1022689900470.
doi:10.1186/1687- 4722- 2011- 426793. [33] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publish-
[8] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, A.W.M. Smeul- ers, San Mateo, CA, USA, 1993.
ders, The challenge problem for automated detection of 101 semantic concepts [34] G.H. John, P. Langley, Estimating continuous distributions in Bayesian classi-
in multimedia, in: Proceedings of the 14th ACM International Conference on fiers, in: Proceedings of the Eleventh Conference on Uncertainty in Artificial In-
Multimedia - MM’06, ACM Press, Santa Barbara, CA, USA, 2006, pp. 421–430, telligence - UAI’95, Morgan Kaufmann Publishers, Montréal, Qué, Canada, 1995,
doi:10.1145/1180639.1180727. pp. 338–345.
[9] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: [35] J.C. Platt, Fast training of support vector machines using sequential minimal
T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Proceedings of the 2001 Neu- optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Ker-
ral Information Processing Systems (NIPS) Conference Advances in Neural In- nel Methods: Support Vector Learning, MIT Press, Cambridge, MA, USA, 1999,
formation Processing Systems 14, volume 1, MIT Press, Cambridge, MA, USA, pp. 185–208.
2001, pp. 681–687. [36] J. Read, P. Reutemann, B. Pfahringer, G. Holmes, MEKA: A multi-la-
[10] A. Rivolli, L.C. Parker, A.C.P.L.F. de Carvalho, Food truck recommendation us- bel/multi-target extension to WEKA, J. Mach. Learn. Res. 17 (2016) 1–5.
ing multi-label classification, in: F.M. Pires, S. Abreu (Eds.), Proceedings of the [37] J. Read, F. Perez-Cruz, Deep learning for multi-label classification, 2014.
Progress in Artificial Intelligence. EPIA 2017. Lecture Notes in Computer Sci- [38] M. Friedman, A comparison of alternative tests of significance for the prob-
ence, volume 10423, Springer, Berlin, Heidelberg, 2017, pp. 585–596, doi:10. lem of m rankings, Ann. Math. Stat. 11 (1) (1940) 86–92, doi:10.1214/aoms/
1007/978- 3- 319- 65340- 2_48. 1177731944.
[11] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, Int. J. Data [39] R.L. Iman, J.M. Davenport, Approximations of the critical region of the Fbietkan
Warehous. Min. 3 (3) (2007) 1–13, doi:10.4018/jdwm.2007070101. statistic, Commun. Stat. Theory Methods 9 (6) (1980) 571–595, doi:10.1080/
[12] G. Madjarov, D. Kocev, D. Gjorgjevikj, S. Džeroski, An extensive experimental 03610928008827904.
comparison of methods for multi-label learning, Pattern Recogn. 45 (9) (2012) [40] P.B. Nemenyi, Distribution-free Multiple Comparisons, Princeton University,
3084–3104, doi:10.1016/j.patcog.2012.03.004. 1963 Ph.D. thesis.
[13] M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEE [41] E. Spyromitros, G. Tsoumakas, I. Vlahavas, An empirical study of lazy multilabel
Trans. Knowl. Data Eng. 26 (8) (2014) 1819–1837, doi:10.1109/TKDE.2013.39. classification algorithms, in: J. Darzentas, G.A. Vouros, S. Vosinakis, A. Arnellos
[14] G. Tsoumakas, I. Katakis, Multi-label classification, in: J. Wang (Ed.), Data (Eds.), Artificial Intelligence: Theories, Models and Applications. SETN. Lecture
Warehousing and Mining: Concepts, Methodologies, Tools and Applications, IGI Notes in Computer Science, volume 5138, Springer, Berlin, Heidelberg, 2008,
Global, Hershey, PA, USA, 2008, pp. 64–74, doi:10.4018/978- 1- 59904- 951- 9. pp. 401–406, doi:10.1007/978- 3- 540- 87881- 0_40.
ch006.
[15] G. Tsoumakas, I. Vlahavas, Random k-labelsets: An ensemble method for Dr Edward K. Y. Yapp is a Scientist at the Singapore Insti-
multilabel classification, in: J.N. Kok, J. Koronacki, R.L. Mantaras, S. Matwin, tute of Manufacturing Technology (SIMTech). He received
D. Mladenič, A. Skowron (Eds.), Proceedings of the Machine Learning: ECML his B.Eng. and B.Fin. degrees from The University of Ade-
Lecture Notes in Computer Science, volume 4701, Springer, Berlin, Heidelberg, laide in 2010, and his Ph.D. degree from the University of
2007, pp. 406–417, doi:10.1007/978- 3- 540- 74958- 5_38. Cambridge in 2016. His research interests are in combus-
[16] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label clas- tion and artificial intelligence.
sification, in: W. Buntine, M. Grobelnik, D. Mladenić, J. Shawe-Taylor (Eds.),
Proceedings of the Machine Learning and Knowledge Discovery in Databases.
ECML PKDD. Lecture Notes in Computer Science, vol 5782, Springer, Berlin,
Heidelberg, 2009, pp. 254–269, doi:10.1007/978- 3- 642- 04174- 7_17.
[17] J.M. Moyano, E.L. Gibaja, K.J. Cios, S. Ventura, Review of ensembles of multi-
label classifiers: Models, experimental study and prospects, Inf. Fus. 44 (2018)
33–45, doi:10.1016/J.INFFUS.2017.12.001.
[18] S.-H. Park, J. Fürnkranz, Efficient pairwise classification, in: J.N. Kok, J. Ko- Dr Xiang Li is currently a Senior Scientist and Team
ronacki, R.L. Mantaras, S. Matwin, D. Mladenić, A. Skowron (Eds.), Ma- Lead at Singapore Institute of Manufacturing Technology
chine Learning: ECML 2007. ECML 2007. Lecture Notes in Computer Science, (SIMTech). She has more than 20 years of experience in
volume 4701, Springer, Berlin, Heidelberg, 2007, pp. 658–665, doi:10.1007/ research on machine learning, data mining and artificial
978- 3- 540- 74958- 5_65. intelligence. Her research interests include big data ana-
[19] E. Loza Mencía, S.-H. Park, J. Fürnkranz, Efficient voting prediction for pairwise lytics, machine learning, deep learning, data mining, deci-
multilabel classification, Neurocomputing 73 (7-9) (2010) 1164–1176, doi:10. sion support systems, and knowledge-based systems.
1016/J.NEUCOM.2009.11.024.
[20] J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, K. Brinker, Multilabel classification
via calibrated label ranking, Mach. Learn. 73 (2) (2008) 133–153, doi:10.1007/
s10994- 008- 5064- 8.
[21] E. Hüllermeier, J. Fürnkranz, W. Cheng, K. Brinker, Label ranking by learning
pairwise preferences, Artif. Intell. 172 (16-17) (2008) 1897–1916, doi:10.1016/j. Dr Wen Feng Lu is currently the Associate Professor of
artint.20 08.08.0 02. Department of Mechanical Engineering at the National
[22] G. Tsoumakas, I. Katakis, I. Vlahavas, Random k-labelsets for multilabel clas- University of Singapore (NUS). He has about 30 years of
sification, IEEE Trans. Knowl. Data Eng. 23 (7) (2011) 1079–1089, doi:10.1109/ research experience in intelligent manufacturing, includ-
TKDE.2010.164. ing using machine learning in data analytics. His research
[23] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi- interests include machine learning, data analytics, intel-
label classification, Mach. Learn. 85 (3) (2011) 333–359, doi:10.1007/ ligent manufacturing, engineering design technology, and
s10994- 011- 5256- 5. 3D printing.
[24] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140, doi:10.
10 07/BF0 0 058655.
[25] R.E. Schapire, Y. Singer, BoosTexter: A boosting-based system for text catego-
rization, Mach. Learn. 39 (2/3) (20 0 0) 135–168, doi:10.1023/A:1007649029923.
[26] S. Godbole, S. Sarawagi, Discriminative methods for multi-labeled classifica-
tion, in: H. Dai, R. Srikant, C. Zhang (Eds.), Proceedings of the Advances in Dr Puay Siew Tan leads the Manufacturing Control
Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Com- TowerTM (MCTTM ) responsible for the setup of Model Fac-
puter Science, volume 3056, Springer, Berlin, Heidelberg, 2004, pp. 22–30, tory@SIMTech. Her research has been in the cross-field
doi:10.1007/978- 3- 540- 24775- 3_5. disciplines of Computer Science and Operations Research
[27] N. Ghamrawi, A. McCallum, Collective multi-label classification, in: Proceed- for cyber physical production system (CPPS) collaboration,
ings of the 14th ACM International Conference on Information and Knowl- in particular sustainable complex manufacturing and sup-
edge Management - CIKM’05, ACM Press, Bremen, Germany, 2005, pp. 195– ply chain operations. To this end, she has been active in
200, doi:10.1145/1099554.1099591. using context-aware and services techniques.
[28] Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr.
1 (1-2) (1999) 69–90, doi:10.1023/A:1009982220290.
[29] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: O. Maimon,
L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, Springer,
Boston, MA, USA, 2009, pp. 667–685, doi:10.1007/978- 0- 387- 09823- 4_34.

Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing

Uploaded by

Copyright:

Available Formats

Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing

Uploaded by

Copyright:

Available Formats

Neurocomputing 394 (2020) 51–60

Contents lists available at ScienceDirect

Comparison of base classiﬁers for multi-label learning

1. Introduction can be found in [13]. A preliminary analysis was conducted in

Methods [Ref.] Descriptions Advantages Disadvantages

rank the labels. Bmicro = B T Pj , F Pj , F Nj , (8)

where B is the precision, recall or F1 as deﬁned earlier in Eqs. (3)–

Subset accuracy (or exact match) is the fraction of correctly

Examples Features Label Label Distinct

emotions Audio 391 202 0 72 6 1.87 0.311 0.046 1.69E5

Classiﬁers Parameters (values)

Binary Relevance (BR) None

(a) Binary Relevance (b) Classifier Chains

(c) Calibrated Label Ranking (d) Quick Weighted Multi-label Learning

(e) Hierarchy Of Multilabel classifiERs (f) RAndom k-labELsets

(g) Ensemble of Classifier Chains

4.2. Experiment 2: relationship between base classiﬁer and

Next, independent of the dataset, can the performance of the

NB {k-NN, DT, SVM}

k-NN {DT, SVM}; NB SVM

SVM {k-NN, NB}

SVM {k-NN, NB}

SVM {k-NN, NB}

SVM {k-NN, NB}

SVM {k-NN, NB}

{k-NN, DT, NB} SVM

{k-NN, DT, NB} SVM

{k-NN, DT, NB} SVM

{k-NN, DT} SVM

{k-NN, DT, NB}

{k-NN, DT, NB}

{k-NN, DT, NB}

Based on the results in Table 4, a similar technique as [41] is

SVM {k-NN, NB}

SVM {k-NN, NB}

the worst performance. This is not surprising given the simplis-

penalised for its performance with HOMER; k-NN’s performance–

training time trade-off is somewhere between DT and SVM. Dif-

ferences between base classiﬁers are more obvious at a conﬁdence

level of 0.1 as shown in Table 7.

The delicious and bookmarks datasets have the highest com-

Metrics BR CC CLR QWML HOMER RAkEL ECC

E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60

Table 6 smaller datasets. k-NN- and NB-based methods do not show

Declaration of Competing Interest

Supplementary material associated with this article can be

You might also like