Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
Comparison of Base Classifiers For Multi Label Learning - 2020 - Neurocomputing
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Multi-label learning methods can be categorised into algorithm adaptation, problem transformation and
Received 27 September 2019 ensemble methods. Some of these methods depend on a base classifier and the relationship is not well
Revised 20 November 2019
understood. In this paper the sensitivity of five problem transformation and two ensemble methods to
Accepted 28 January 2020
four types of classifiers is studied. Their performance across 11 benchmark datasets is measured using
Available online 6 February 2020
16 evaluation metrics. The best classifier is shown to depend on the method: Support Vector Machines
Communicated by Dr. Oneto Luca (SVM) for binary relevance, classifier chains, calibrated label ranking, quick weighted multi-label learning
and RAndom k-labELsets; k-Nearest Neighbours (k-NN) and Naïve Bayes (NB) for Hierarchy Of Multilabel
Keywords:
Classification classifiERs; and Decision Trees (DT) for ensemble of classifier chains. The statistical performance of a
Classifier classifier is also found to be generally consistent across the metrics for any given method. Overall, DT
Experimental comparison and SVM have the best performance–computational time trade-off followed by k-NN and NB.
Multi-label
© 2020 Elsevier B.V. All rights reserved.
Multilabel
https://doi.org/10.1016/j.neucom.2020.01.102
0925-2312/© 2020 Elsevier B.V. All rights reserved.
52 E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60
Table 1
Summary of the problem transformation (rows 1–5) and ensemble (rows 6 and 7) methods used in this study.
Binary Relevance (BR) [11] Individual classifier for each label Simple Ignores label correlations
Classifier Chain (CC) [16] Extension of BR Models label correlations Performance depends on the order of
classifiers in the chain
Calibrated Label Ranking (CLR) [20] Ranking by pairwise comparison Mitigates class imbalance issue Number of classifiers scales quadratically
method with the number of labels
Quick Weighted Multi-label Learning Extension of CLR Reduces the number of classifiers that Number of classifiers that has to be stored
(QWML) [19] are evaluated to make a prediction is still quadratic
Hierarchy Of Multilabel classifiERs Label powerset method Computationally efficient for datasets Additional parameter to tune, i.e. the
(HOMER) [4] with a large number of labels number of clusters
RAndom k-labELsets (RAkEL) [15] Extension of the label powerset Improvement over label powerset Random nature may include models that
method method for a large number of labels affect ensemble in a negative way
and training examples
Ensemble of Classifier Chains (ECC) [16] Extension of CC Chain ordering is less likely to Potentially large redundancy in learning
negatively affect performance space
an interesting avenue of future research as they may have spe- with a label is a positive example for the label and negative for the
cific classifier advantages. We also developed one of the prob- calibration label, and vice versa. This in fact directly corresponds to
lem transformation methods (quick weighted multi-label learning the classifiers trained in BR.
[18,19]) without which the modelling of certain datasets was not However, during the testing phase, a quadratic number of clas-
possible. sifiers has to be evaluated. It is possible to evaluate a smaller sub-
The paper is organised as follows: Section 2 is a background on set of it to determine the label with the highest accumulated vot-
the multi-label learning methods used in this study. Section 3 pro- ing mass. The quick weighted algorithm excludes or exactly prunes
vides details on the evaluation metrics, datasets, and setup and the set of possible top-ranked labels, even if they reach the maxi-
method used. The key results and discussion are in Section 4 fol- mal voting mass in the remaining evaluations [18]. In the context
lowed by the conclusions in Section 5. The complete set of results of Quick Weighted Multi-label Learning (QMWL), there are two
may be found in the Supplementary Material. variants: (1) where the above process is repeated until the top la-
bel that is returned is the calibration label and all remaining labels
are irrelevant and (2) an improved version where evaluation of the
2. Background current top-ranked label is stopped once it has received a higher
voting mass than the calibration label [19]. The latter is the vari-
In this section, we provide a brief description of the five prob- ant that will be used in this paper.
lem transformation and two ensemble methods used in this study Hierarchy Of Multilabel classifiERs (HOMER) [4] was developed
which all depend on a base classifier. From here on, it is assumed to transform a large set of labels into a tree-shaped hierarchy of
that binary labels are assigned to each training example. A sum- simpler multi-label classification tasks, each dealing with a much
mary of each method as well as the advantages and disadvantages smaller number of labels and a more balanced example distribu-
may be found in Table 1. tion. In a top-down, depth-first approach starting with the root,
HOMER recursively and evenly distributes the labels into k children
2.1. Problem transformation methods nodes. This is achieved using a balanced k means algorithm which
is repeated for a number of iterations. A training example is anno-
Binary Relevance (BR) considers each label as an independent tated with a meta-label if it is annotated with at least one of the
binary problem and in a one-vs-all or one-vs-the-rest strategy one labels at the respective nodes. A classifier is then trained at each
classifier is fitted per label [11]. Classifier Chains (CC) is an ex- node, apart from the leaves (single label), for the meta-labels of its
tension of BR where classifiers are linked along a chain (of length children.
equal to the number of labels q) and the feature space of each link
is extended with the label relevances of all previous links. By pass- 2.2. Ensemble methods
ing information between classifiers it is capable of exploiting label
correlations unlike BR [16]. The label powerset method considers each subset of the la-
In Ranking by Pairwise Comparison (RPC), the multi-label bels in the training set as a different class of a single-label clas-
dataset is first transformed into q(q − 1 )/2 binary training sets, one sification problem. However, there are two limitations: (1) it is
for each label pair, where only training examples with distinct rel- unable to predict label sets not in the training set and (2) the
evance are included. Then in a one-vs-one strategy, one classifier computational complexity grows exponentially with the number
is constructed for each label pair. To predict the labels of a new of labels. RAndom k-labELsets (RAkEL) [15] constructs an ensem-
example, all the classifiers are invoked and the labels are ranked ble of m label powerset classifiers (addresses the first limita-
according to the votes received (soft or hard output). A threshold- tion) trained on a small random subset of size k of the label set
ing function bipartitions the list of ranked labels into relevant and (addresses the second limitation). The average binary decisions
irrelevant labels [21]. across the m classifiers for each label which exceed a threshold
It was argued that RPC does not have a natural “zero-point”; determines the final prediction (see [22], Table 1, for a worked
therefore, Calibrated Label Ranking (CLR) was proposed where example).
a calibration/virtual label is introduced which represents a split The order of classifiers in CC is random and could be poorly
point between relevant and irrelevant labels [20]. This approach ordered. Ensemble of Classifier Chains (ECC) [16] trains m clas-
also negates the need for a thresholding function. In addition to sifier chains, each with a random chain order, on a random
the q(q − 1 )/2 classifiers trained in the RPC step, q classifiers are subset of the training set (analogous to the label set subsam-
trained, one for each label, where each example which is annotated pling in RAkEL). Sampling can be performed without replace-
E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60 53
ment [16], or with replacement [23] using the bagging scheme classified examples, i.e. the predicted label set is identical to the
for higher predictive performance [24] as will be used in this true label set [27]:
paper.
1
p
subsetAccuracy = h(xi ) = Yi , (6)
p
i=1
3. Methodology
where the Iverson bracket P converts the logical proposition P to
In this section, the evaluation metrics used to assess the perfor- 1 if the proposition is satisfied, and 0 otherwise.
mance of the multi-label learning methods are presented, followed
by the datasets and the computational setup used.
3.1.2. Label-based metrics
There are two ways of averaging across the labels [28]:
3.1. Evaluation metrics Macro-averaging:
1
q
The metrics are categorised into: (1) example-based where the
Bmacro = B ( T P j , F P j , F N j ), (7)
metric is first computed for each example across the labels, then q
j=1
averaged across all the examples, (2) label-based where the metric
is first computed for each label across the examples, then averaged Micro-averaging:
across all the labels, and (3) ranking-based as a confidence is as-
sociated with the prediction of each label which may be used to
q
q
q
Table 2
Multi-label datasets and various statistics. Rows 1–5 correspond to simpler datasets where 10-fold cross validation was used, while rows
6–11 correspond to datasets which are more complex and a train-test split was used instead.
Name Domain Train Test Nominal Numeric Labels cardinality density proportion Complexity
Ranking loss is the average fraction of reversely-ordered label label sets is [16]:
pairs [25]:
1
proportionLabelDiversity = × l abel Diversity
p
1 1
p
rankingLoss = |{(y , y ) | f (xi , y ) |{Y | ∃x : (x, Y ) ∈ D}|
p
i=1
|Yi ||Ȳi | = . (16)
p
≤ f (xi , y ), (y , y ) ∈ Yi × Ȳi }|, (11)
The datasets have been sorted according to their complex-
ity (defined as the product of the number of training examples,
given that y ∈ Yi and y ∈ Ȳi (complementary set of irrelevant
features and labels [16]) in increasing order from top to bot-
labels). The classification function f misorders the pair if f(xi ,
tom. Note that all of the datasets, with the exception of book-
y ) ≤ f(xi , y ).
marks, have been pre-split into training and test sets with an
Average precision evaluates the average fraction of relevant la-
approximate 2:1 ratio in most cases. For the bookmarks dataset,
bels ranked above a particular relevant label y ∈ Yi [25]:
the first 60 0 0 0 examples were taken as the training set while
the remaining 27856 examples were taken as the test set [12].
1 1
p
averageP recision = This was performed as the original dataset (tas: contains the
p |Yi |
i=1 y∈Yi
tags that a particular user has assigned to a particular item)
from which the bookmarks and bibtex datasets are derived is
|{y | rank f (xi , y ) ≤ rank f (xi , y ), y ∈ Yi }| not available [3]. A shorter version of the tmc2007 dataset was
× .
rank f (xi , y ) used, where the top 500 of the original 49060 features were
(12) selected [2]. Further details of each dataset can be found in
[12].
In summary, lower values are desired for hammingLoss, oneError,
coverage and rankingLoss in Eqs. (1) and (9) – (11), respectively; and 3.3. Setup and method
higher values for the rest.
All experiments were performed on an Intel Xeon Phi 7210
with Intel KNL (Knights Landing) architecture and RedHat Enter-
3.2. Datasets prise Linux 7 operating system. Each experiment was allowed up
to 128 GB RAM and 14 days of wall clock time.
The multi-label datasets used in this study2 and their associ- All of the algorithms were evaluated using the following li-
ated statistics are shown in Table 2. Label cardinality is the average braries under Java JDK 10.0.2. MULAN2 is a Java library for multi-
number of labels per example [11]: label learning and only provides an application programming in-
terface [30]—release 1.5.0 was used for CLR, QWML, HOMER and
RAkEL. WEKA is a data mining software implemented in Java
1
p
l abelCardinal ity = |Yi |. (13) and provides both graphical and command-line interfaces [31]—
p development 3.9.2 was used for the base classifiers k-Nearest
i=1
Neighbours (k-NN) [32], Decision Trees (DT) [33], Naïve Bayes
Label density is the label cardinality normalised by the total (NB) [34] and Support Vector Machines (SVM) [35]. MEKA3 is a
number of possible labels [11]: multi-label/multi-target extension to WEKA [36] and provides a
wrapper around MULAN—release 1.9.3 was used for BR, CC and
1
q
1 ECC.
l abel Density = × l abelCardinal ity = |Yi |. (14) In QWML, pairwise classifiers are selected based on the amount
q pq
i=1
of potential voting mass a label has not received, or the voting
loss defined as li := pi − vi where pi is the number of evaluated
Label diversity, or the number of distinct label sets is [29]:
classifiers and vi is the number of votes received by label yi . Two
l abel Diversity = |{Y | ∃x : (x, Y ) ∈ D}|, (15)
2
https://github.com/tsoumakas/mulan.
where D is the set of training examples. The proportion of distinct 3
https://github.com/Waikato/meka.
E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60 55
Table 3
The multi-label learning methods (rows 1–7) and base classifiers (rows 8–11) and their associated parameters used in this work.
changes had to be made to the algorithm without which the mod- Instead of fine tuning the hyperparameters of the base clas-
elling of certain datasets was not possible. First, in the case that no sifiers to optimise the performance of the multi-label learning
data exist for the one-vs-one learning between labels ya and yb , i.e. method, default values were used as shown in Table 3. This enables
no distinct relevances, the number of possible evaluated classifiers a fair comparison amongst the methods and the hyperparameters
is no longer equal to the number of labels. It is tracked separately are supposed to generalise across the different applications [37].
for each label and adjusted accordingly. Second, an explicit check Note that this implies that a linear kernel was used for SVM as
is made that all of the possible classifiers have been evaluated. The was used in [16,23].
modified QWML method is shown in Algorithm 2. A thresholding function bipartitions the list of ranked labels
into relevant and irrelevant labels. With the exception of CLR and
QWML which do not rely on a thresholding function, the threshold
Algorithm 2 Modified Quick Weighted Multi-label Learning
is automatically calibrated to minimise the difference between the
(QWML)
label cardinality of the training set and the predictions on the test
Input: example x; classifiers {hu,v | u < v, yu , yv ∈ Y }; losses set [23].
l0 , . . . , lq = 0; number of possible evaluated classifiers
n0 , . . . , nq = q
1: v0 ← 0, P ← ∅
4. Results and discussion
2: for j = 0 to q do
3: l j ← h0, j (x ) Losses based on classifiers of calibrated label
The complete set of results for the different datasets, multi-
y0
label learning methods and base classifiers across the different
4: v0 ← v0 + ( 1 − l j ) Votes for y0
evaluation metrics (including training time) may be found in Sup-
5: repeat plementary Material Tables S1 to S17. Unless stated otherwise, the
6: while ytop not determined do discussion is based only on the datasets where a complete set of
7: ya ← arg miny j ∈Y l j results is obtained for all multi-label learning methods and base
8: if pa < na then Check of the number of evaluated classifiers.
classifiers of ya
9: yb ← arg minyi ∈Y \{ya } l j and ha,b not yet evaluated
10: if va ≥ v0 or no yb exists then 4.1. Experiment 1: relationship between base classifier and size of
11: ytop ← ya dataset
12: else
13: if ha,b exists then We would like to understand how the different base classifiers
14: vab ← ha,b (x ) perform. Ref. [12] argues that the performance of the classifier
15: va ← va + vab is related to the size of the dataset where SVM-based methods
16: vb ← vb + (1 − vab ) perform better for smaller datasets while DT-based methods per-
17: la ← la + (1 − vab ) form better for larger datasets. However, the base classifier was
18: lb ← lb + vab not systematically varied; therefore, the effect of the multi-label
19: else No distinct relevances learning method was not separated from the base classifier. To fur-
20: na ← na − 1 ther test this hypothesis, we rank the base classifiers across all the
21: nb ← nb − 1 multi-label learning methods and evaluation metrics, and compare
22: if vtop ≥ v0 then the average rankings for small, medium and large datasets as de-
23: P ← P ∪ ytop fined by the varying orders of magnitude of examples. The results
24: ltop ← +∞ are shown in Fig. 1 where the capped bars represent ± two
25: until vtop ≥ v0 and |P | < n standard deviations around the average ranking. There is indeed
26: return P a clear trend where the ranking of SVM-based methods improves
from large to small datasets; vice versa for DT-based methods. The
same cannot be said for k-NN- and NB-based methods: k-NN-based
10-fold cross validation was performed on the emotions, scene, methods display a similar trend as DT-based methods, but do not
yeast, medical and enron datasets; however, a simple train-test perform particularly well for medium datasets; there is no signifi-
split was used for the remaining datasets as the computational ex- cant difference in ranking between small and large datasets for NB-
pense of these simulations was too expensive. based methods, but the ranking for medium datasets is the best.
56 E.K. Y. Yapp, X. Li and W.F. Lu et al. / Neurocomputing 394 (2020) 51–60
Fig. 1. Rankings of base classifiers for datasets of varying orders of magnitude of examples, where a complete set of results was obtained for the multi-label learning
methods and base classifiers.
DT {k-NN, NB}
DT {k-NN, NB}
{DT, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
method?
SVM NB
SVM NB
DT NB
DT NB
DT NB
DT NB
To explore this question we perform the corrected Friedman
None
None
None
ECC
test [38,39] followed by the Nemenyi post-hoc test [40] for each
multi-label learning method to detect whether there is a signif-
icant difference between the base classifiers across the datasets;
if so, where does this difference lie. Fig. 2 compares the average
ranking of the base classifiers at a confidence level of 0.05 for the
DT {k-NN, NB}
DT {k-NN, NB}
DT {k-NN, NB}
different if they differ by at least the critical difference defined as
SVM NB
SVM NB
SVM NB
Nemenyi post-hoc test at a 0.05 confidence level based on completed datasets for the different multi-label learning methods across the different evaluation metrics.
[40]:
RAkEL
None
None
k (k + 1 )
criticalDif f erence = qα , (17)
6N
where qα is the critical value at the confidence level α , k is the
number of base classifiers and N is the number of datasets. (q0.05
= 2.569 and q0.10 = 2.291 for k = 4 for the two-tailed Nemenyi
SVM
SVM
SVM
SVM
NB {DT, SVM}
k-NN SVM
tion metrics where “None” indicates that the post-hoc test is not
HOMER
None
None
powerful enough to detect any significant differences between the
base classifiers. Given four classifiers a, b, c and d, {a, b}{c} indi-
cates that: (1) classifiers a and b are significantly better than c, (2)
there is no significant difference between a and b, and (3) there is
insufficient data to reach any conclusion about d.
{k-NN, SVM} NB
DT {k-NN, SVM}
We can make the following observations: for any given multi-
label learning method, the performance of the base classifier is SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
NB DT
QWML
None
None
None
None
None
None
generally consistent across all the evaluation metrics, i.e., a base
classifier which is statistically superior for a given evaluation met-
ric tends to also do better in terms of another metric. This gives
some assurance as comparisons in the literature are usually based
on a few popular metrics such as Hamming loss and accuracy. Sec-
{DT, SVM} NB
ond, there is a direct relationship between the base classifier and
NB k-NN
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
the multi-label learning method: SVM is the best classifier for BR,
None
None
None
None
None
None
CLR
CC, CLR, QWML and RAkEL; k-NN and NB for HOMER (DT has a
significantly longer training time); and DT for ECC. The results at a
confidence level of 0.1 (see Table 5) are similar.
{k-NN, SVM} NB
{DT, SVM} NB
{DT, SVM} NB
NB {DT, SVM}
4.3. Experiment 3: general results
SVM NB
SVM NB
SVM NB
SVM NB
SVM NB
NB DT
NB DT
NB DT
None
None
None
None
None
CC
NB {k-NN, SVM}
all the metrics but the tradeoff is that it has the lowest score for
training time. Conversely, NB has the fastest training time but has
None
None
None
None
None
None
None
None
None
None
None
BR
Micro precision
Hamming loss
Ranking-based
Macro recall
Micro recall
Training time
One error
Rank loss
Precision
Coverage
Macro F1
Accuracy
Micro F1
plexity, and not all of the simulations were able to finish within
the given memory and time constraints. The above analysis was
repeated for all datasets where a ranking was only assigned to sim-
ulation runs which finished—similar results were found.
58
Table 5
Nemenyi post-hoc test at a 0.1 confidence level based on completed datasets for the different multi-label learning methods across the different evaluation metrics.
[6] P. Duygulu, K. Barnard, J.F.G. de Freitas, D.A. Forsyth, Object recognition as ma- [30] G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas, MULAN: A Java li-
chine translation: Learning a lexicon for a fixed image vocabulary, in: A. Hey- brary for multi-label learning, J. Mach. Learn. Res. 12 (2011) 2411–2414.
den, G. Sparr, M. Nielsen, P. Johansen (Eds.), Proceedings of the Computer [31] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The
Vision — ECCV. Lecture Notes in Computer Science, volume 2353, Springer, WEKA data mining software, ACM SIGKDD Explorations Newsletter 11 (1)
Berlin, Heidelberg, 2002, pp. 97–112, doi:10.1007/3- 540- 47979- 1_7. (2009) 10–18, doi:10.1145/1656274.1656278.
[7] K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas, Multi-label classification of [32] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Mach.
music by emotion, EURASIP J. Audio, Speech Music Process. 2011 (1) (2011) 4, Learn. 6 (1) (1991) 37–66, doi:10.1023/A:1022689900470.
doi:10.1186/1687- 4722- 2011- 426793. [33] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publish-
[8] C.G.M. Snoek, M. Worring, J.C. van Gemert, J.-M. Geusebroek, A.W.M. Smeul- ers, San Mateo, CA, USA, 1993.
ders, The challenge problem for automated detection of 101 semantic concepts [34] G.H. John, P. Langley, Estimating continuous distributions in Bayesian classi-
in multimedia, in: Proceedings of the 14th ACM International Conference on fiers, in: Proceedings of the Eleventh Conference on Uncertainty in Artificial In-
Multimedia - MM’06, ACM Press, Santa Barbara, CA, USA, 2006, pp. 421–430, telligence - UAI’95, Morgan Kaufmann Publishers, Montréal, Qué, Canada, 1995,
doi:10.1145/1180639.1180727. pp. 338–345.
[9] A. Elisseeff, J. Weston, A kernel method for multi-labelled classification, in: [35] J.C. Platt, Fast training of support vector machines using sequential minimal
T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Proceedings of the 2001 Neu- optimization, in: B. Schölkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Ker-
ral Information Processing Systems (NIPS) Conference Advances in Neural In- nel Methods: Support Vector Learning, MIT Press, Cambridge, MA, USA, 1999,
formation Processing Systems 14, volume 1, MIT Press, Cambridge, MA, USA, pp. 185–208.
2001, pp. 681–687. [36] J. Read, P. Reutemann, B. Pfahringer, G. Holmes, MEKA: A multi-la-
[10] A. Rivolli, L.C. Parker, A.C.P.L.F. de Carvalho, Food truck recommendation us- bel/multi-target extension to WEKA, J. Mach. Learn. Res. 17 (2016) 1–5.
ing multi-label classification, in: F.M. Pires, S. Abreu (Eds.), Proceedings of the [37] J. Read, F. Perez-Cruz, Deep learning for multi-label classification, 2014.
Progress in Artificial Intelligence. EPIA 2017. Lecture Notes in Computer Sci- [38] M. Friedman, A comparison of alternative tests of significance for the prob-
ence, volume 10423, Springer, Berlin, Heidelberg, 2017, pp. 585–596, doi:10. lem of m rankings, Ann. Math. Stat. 11 (1) (1940) 86–92, doi:10.1214/aoms/
1007/978- 3- 319- 65340- 2_48. 1177731944.
[11] G. Tsoumakas, I. Katakis, Multi-label classification: An overview, Int. J. Data [39] R.L. Iman, J.M. Davenport, Approximations of the critical region of the Fbietkan
Warehous. Min. 3 (3) (2007) 1–13, doi:10.4018/jdwm.2007070101. statistic, Commun. Stat. Theory Methods 9 (6) (1980) 571–595, doi:10.1080/
[12] G. Madjarov, D. Kocev, D. Gjorgjevikj, S. Džeroski, An extensive experimental 03610928008827904.
comparison of methods for multi-label learning, Pattern Recogn. 45 (9) (2012) [40] P.B. Nemenyi, Distribution-free Multiple Comparisons, Princeton University,
3084–3104, doi:10.1016/j.patcog.2012.03.004. 1963 Ph.D. thesis.
[13] M.-L. Zhang, Z.-H. Zhou, A review on multi-label learning algorithms, IEEE [41] E. Spyromitros, G. Tsoumakas, I. Vlahavas, An empirical study of lazy multilabel
Trans. Knowl. Data Eng. 26 (8) (2014) 1819–1837, doi:10.1109/TKDE.2013.39. classification algorithms, in: J. Darzentas, G.A. Vouros, S. Vosinakis, A. Arnellos
[14] G. Tsoumakas, I. Katakis, Multi-label classification, in: J. Wang (Ed.), Data (Eds.), Artificial Intelligence: Theories, Models and Applications. SETN. Lecture
Warehousing and Mining: Concepts, Methodologies, Tools and Applications, IGI Notes in Computer Science, volume 5138, Springer, Berlin, Heidelberg, 2008,
Global, Hershey, PA, USA, 2008, pp. 64–74, doi:10.4018/978- 1- 59904- 951- 9. pp. 401–406, doi:10.1007/978- 3- 540- 87881- 0_40.
ch006.
[15] G. Tsoumakas, I. Vlahavas, Random k-labelsets: An ensemble method for Dr Edward K. Y. Yapp is a Scientist at the Singapore Insti-
multilabel classification, in: J.N. Kok, J. Koronacki, R.L. Mantaras, S. Matwin, tute of Manufacturing Technology (SIMTech). He received
D. Mladenič, A. Skowron (Eds.), Proceedings of the Machine Learning: ECML his B.Eng. and B.Fin. degrees from The University of Ade-
Lecture Notes in Computer Science, volume 4701, Springer, Berlin, Heidelberg, laide in 2010, and his Ph.D. degree from the University of
2007, pp. 406–417, doi:10.1007/978- 3- 540- 74958- 5_38. Cambridge in 2016. His research interests are in combus-
[16] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label clas- tion and artificial intelligence.
sification, in: W. Buntine, M. Grobelnik, D. Mladenić, J. Shawe-Taylor (Eds.),
Proceedings of the Machine Learning and Knowledge Discovery in Databases.
ECML PKDD. Lecture Notes in Computer Science, vol 5782, Springer, Berlin,
Heidelberg, 2009, pp. 254–269, doi:10.1007/978- 3- 642- 04174- 7_17.
[17] J.M. Moyano, E.L. Gibaja, K.J. Cios, S. Ventura, Review of ensembles of multi-
label classifiers: Models, experimental study and prospects, Inf. Fus. 44 (2018)
33–45, doi:10.1016/J.INFFUS.2017.12.001.
[18] S.-H. Park, J. Fürnkranz, Efficient pairwise classification, in: J.N. Kok, J. Ko- Dr Xiang Li is currently a Senior Scientist and Team
ronacki, R.L. Mantaras, S. Matwin, D. Mladenić, A. Skowron (Eds.), Ma- Lead at Singapore Institute of Manufacturing Technology
chine Learning: ECML 2007. ECML 2007. Lecture Notes in Computer Science, (SIMTech). She has more than 20 years of experience in
volume 4701, Springer, Berlin, Heidelberg, 2007, pp. 658–665, doi:10.1007/ research on machine learning, data mining and artificial
978- 3- 540- 74958- 5_65. intelligence. Her research interests include big data ana-
[19] E. Loza Mencía, S.-H. Park, J. Fürnkranz, Efficient voting prediction for pairwise lytics, machine learning, deep learning, data mining, deci-
multilabel classification, Neurocomputing 73 (7-9) (2010) 1164–1176, doi:10. sion support systems, and knowledge-based systems.
1016/J.NEUCOM.2009.11.024.
[20] J. Fürnkranz, E. Hüllermeier, E. Loza Mencía, K. Brinker, Multilabel classification
via calibrated label ranking, Mach. Learn. 73 (2) (2008) 133–153, doi:10.1007/
s10994- 008- 5064- 8.
[21] E. Hüllermeier, J. Fürnkranz, W. Cheng, K. Brinker, Label ranking by learning
pairwise preferences, Artif. Intell. 172 (16-17) (2008) 1897–1916, doi:10.1016/j. Dr Wen Feng Lu is currently the Associate Professor of
artint.20 08.08.0 02. Department of Mechanical Engineering at the National
[22] G. Tsoumakas, I. Katakis, I. Vlahavas, Random k-labelsets for multilabel clas- University of Singapore (NUS). He has about 30 years of
sification, IEEE Trans. Knowl. Data Eng. 23 (7) (2011) 1079–1089, doi:10.1109/ research experience in intelligent manufacturing, includ-
TKDE.2010.164. ing using machine learning in data analytics. His research
[23] J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi- interests include machine learning, data analytics, intel-
label classification, Mach. Learn. 85 (3) (2011) 333–359, doi:10.1007/ ligent manufacturing, engineering design technology, and
s10994- 011- 5256- 5. 3D printing.
[24] L. Breiman, Bagging predictors, Mach. Learn. 24 (2) (1996) 123–140, doi:10.
10 07/BF0 0 058655.
[25] R.E. Schapire, Y. Singer, BoosTexter: A boosting-based system for text catego-
rization, Mach. Learn. 39 (2/3) (20 0 0) 135–168, doi:10.1023/A:1007649029923.
[26] S. Godbole, S. Sarawagi, Discriminative methods for multi-labeled classifica-
tion, in: H. Dai, R. Srikant, C. Zhang (Eds.), Proceedings of the Advances in Dr Puay Siew Tan leads the Manufacturing Control
Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Com- TowerTM (MCTTM ) responsible for the setup of Model Fac-
puter Science, volume 3056, Springer, Berlin, Heidelberg, 2004, pp. 22–30, tory@SIMTech. Her research has been in the cross-field
doi:10.1007/978- 3- 540- 24775- 3_5. disciplines of Computer Science and Operations Research
[27] N. Ghamrawi, A. McCallum, Collective multi-label classification, in: Proceed- for cyber physical production system (CPPS) collaboration,
ings of the 14th ACM International Conference on Information and Knowl- in particular sustainable complex manufacturing and sup-
edge Management - CIKM’05, ACM Press, Bremen, Germany, 2005, pp. 195– ply chain operations. To this end, she has been active in
200, doi:10.1145/1099554.1099591. using context-aware and services techniques.
[28] Y. Yang, An evaluation of statistical approaches to text categorization, Inf. Retr.
1 (1-2) (1999) 69–90, doi:10.1023/A:1009982220290.
[29] G. Tsoumakas, I. Katakis, I. Vlahavas, Mining multi-label data, in: O. Maimon,
L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, Springer,
Boston, MA, USA, 2009, pp. 667–685, doi:10.1007/978- 0- 387- 09823- 4_34.