Can Machine Learning Assist Locating The Excitation of Snore Sound? A Review
Can Machine Learning Assist Locating The Excitation of Snore Sound? A Review
Can Machine Learning Assist Locating The Excitation of Snore Sound? A Review
Abstract—In the past three decades, snoring (affecting Index Terms—Deep learning, machine learning,
more than 30 % adults of the UK population) has been obstructive sleep apnoea, physiological signals, snore
increasingly studied in the transdisciplinary research com- sound.
munity involving medicine and engineering. Early work
demonstrated that, the snore sound can carry important
information about the status of the upper airway, which
facilitates the development of non-invasive acoustic based ABBREVIATIONS
approaches for diagnosing and screening of obstructive AI Artificial Intelligence
sleep apnoea and other sleep disorders. Nonetheless, there
are more demands from clinical practice on finding meth-
BoAW Bad-of-Audio-Words
ods to localise the snore sound’s excitation rather than CNN Convolutional Neural Network
only detecting sleep disorders. In order to further the rel- COMPARE Computational Paralinguistics Challenge
evant studies and attract more attention, we provide a com- CSO Competitive Swarm Optimisation
prehensive review on the state-of-the-art techniques from DISE Drug-Induced Sleep Endoscopy
machine learning to automatically classify snore sounds.
First, we introduce the background and definition of the
DL Deep Learning
problem. Second, we illustrate the current work in detail ELM Extreme Learning Machine
and explain potential applications. Finally, we discuss the EM Expectation-Maximization
limitations and challenges in the snore sound classification EMDF Empirical Mode Decomposition based Features
task. Overall, our review provides a comprehensive guid- ENT Ear, Nose, and Throat
ance for researchers to contribute to this area.
ES Excitation Source
FNN Feedforward Neural Network
FV Fisher Vector
GAN Generative Adversarial Network
This work was supported in part by Zhejiang Lab’s
International Talent Fund for Young Professionals (Project HANAMI), GMM Gausian Mixture Model
P. R. China, in part by JSPS Postdoctoral Fellowship for Research in GRU Gated Recurrent Unit
Japan under Grant P19081 from the Japan Society for the Promotion of HMMs Hidden Markov Models
Science (JSPS), Japan, in part by Grants-in-Aid for Scientific Research
under Grants 19F19081 and 17H00878 from the Ministry of Education, HNR Harmonics to Noise Ratio
Culture, Sports, Science and Technology (MEXT), Japan, and in part HOG Histogram of Oriented Gradients
EU’s HORIZON 2020 under Grant 115902 (RADAR CNS). (Correspond- KELM Kernel based Extreme Learning Machine
ing author: Kun Qian.)
Kun Qian and Yoshiharu Yamamoto are with Educational Physiology KL Kullback-Leibler
Laboratory, Graduate School of Education, The University of Tokyo, LBP Local Binary Pattern
Tokyo 113-0033, Japan (e-mail: [email protected]; [email protected] LDA Linear Discriminant Analysis
tokyo.ac.jp).
Christoph Janott and Werner Hemmert are with the Munich School LLDs Low-Level Descriptors
of Bioengineering, Technische Universität München, 85748 Garching, LPC Linear Predictive Coding
Germany (e-mail: [email protected]; [email protected]). LSTM Long Short-Term Memory
Maximilian Schmitt is with the Chair of Embedded Intelligence for
Health Care & Wellbeing, Universität Augsburg, 86159 Augsburg, MAP Maximum A Posteriori
Germany (e-mail: [email protected]). MFCCs Mel-frequency Cepstral Coefficients
Zixing Zhang is with GLAM – Group on Language, Audio & ML Machine Learning
Music, Imperial College London, London SW7 2AZ, U.K. (e-mail:
[email protected]). MLP Multi-Layer Perceptron
Clemens Heiser is with the Department of Otorhinolaryngology/Head MPSSC Munich-Passau Snore Sound Corpus
and Neck Surgery, Technische Universität München, 81675 Munich, MSV Margin Sampling Voting
Germany (e-mail: [email protected]).
Björn W. Schuller is with GLAM – Group on Language, Audio & MV Majority Voting
Music, Imperial College London, London SW7 2AZ, U.K., and also NB Naïve Bayes
with the ZD.B Chair of Embedded Intelligence for Health Care and OSA Obstructive Sleep Apnoea
Wellbeing, University of Augsburg, 86159 Augsburg, Germany (e-mail:
[email protected]). PR800 Power Ratio at 800 Hz
RASTA Relative Spectral Transform
1234
RASTA-PLP Representations Relative Spectra Perceptual other hand, there is a demand for a low-cost, convenient and
Linear Prediction non-invasive substitute for the increasingly used golden stan-
RF Random Forest dard, drug-induced sleep endoscopy (DISE) [14]. Multichannel
RMSE Root Mean Square Energy pressure measurement [15]–[17] is a pioneering method, which
RNN Recurrent Neural Network could be efficient and applicable for monitoring natural sleep,
SCAT Deep Scattering Spectrum whereas it is still an invasive method that is not well tolerated
SERs Subband Energy Ratios by every subject. It is reasonable to leverage ML technologies to
SF Source Flow develop an approach for automatic localisation of the snore site
SFD Source Flow Derivative using only SnS. Relevant studies are extremely limited but are
SFFS Spectral Frequency Features increasingly developing given the recent advances of artificial
SP Signal Processing intelligence (AI) technologies. During the past three decades,
SVM Support Vector Machine SnS analysis has witnessed three main trends: First (from 1990
SnS Snore Sound to 2012), simple acoustic features were calculated and analysed
TL Transfer Learning with statistical methods; second (from 2013 to 2016), human
UA Upper Airway hand-crafted features were used for training conventional ML
UAR Unweighted Average Recall models; third (from 2017 to present), state-of-the-art deep learn-
UBM Universal Background Model ing (DL) techniques were applied to contribute to extracting
VOTE Velum, Oropharyngeal lateral walls, Tongue, and higher level representations from SnS or even leading to end-
Epiglottis to-end learning from SnS raw data without any human expert
VQ Vector Quantisation knowledge.
WEF Wavelet Energy Features In this work, we aim to provide a thorough and comprehensive
WPTE Wavelet Packet Transform Energy review on ML methods applied to the SnS classification task. The
WTE Wavelet Transform Energy main contributions of this review can be summarised as: First,
XAI Explainable Artificial Intelligence to the best of our knowledge, this is the first review on ML based
e2e end-to-end methods for localising the snore site. Second, we introduce the
k-NN k-Nearest Neighbour reader to the background (including history and definitions) of
scGANs semi-supervised conditional Generative Adver- the relevant studies. In particular, we will indicate the motivation
sarial Networks of this study and highlight its significance in clinical practice.
Third, we introduce both the conventional ML methods and
the advanced deep learning approaches that were successfully
I. INTRODUCTION applied to overcome the challenges of the SnS classification task.
NORING is a prevalent disorder that affects more than 30 % Last but not least, we discuss the current limitations and provide
S adults of the British population [1]. Due to the fast devel-
opment in methodologies and applications of signal processing
perspectives on future work. We hope this review article can be
a good guidance for researchers who share the common interest
(SP) and machine learning (ML) during the past decades, snore to improve the understanding about cutting-edge technologies
sound (SnS) has been increasingly studied within a wide com- for other audiences in biomedical and health informatics.
munity which includes but is not limited to acoustic/audio SP, The remainder of this review article will be organised as
otorhinolaryngology, ML, and biomedical engineering [2]–[4]. follows: First, we give the definition of the problem we are
It was found that, as a common symptom [5], SnS can be used focusing on in Section II. In Section III, the background and
to develop a non-invasive approach for automatically screening related work will be introduced. Then, we present methods and
obstructive sleep apnoea (OSA) [6], which is a serious chronic challenges in a comprehensive review of the existing literature in
sleep disorder affecting the general adult population ranging Section IV. Finally, we discuss the current work and provide an
from 6 % to 17 % [7]. When untreated, OSA cannot only result outlook in Section V before a conclusion is made in Section VI.
in morning headache and daytime sleepiness [8], but also be
an independent risk factor for stroke, hypertension, myocardial II. DEFINITION OF THE PROBLEM
infarction, cardiovascular diseases, and even lead to diabetes,
In this section, we provide a brief introduction of the anatomy
and cause accidents [9], [10].
of the upper airways. Then, we explain and compare the different
As indicated in a comprehensive review article by Roe-
categories of the snore site.
buck et al. [3], an audio recording based method (mainly focused
on SnS analysis) can be useful as an inexpensive method for
monitoring sleep. However, most existing literature aimed to use A. Anatomy
SnS for detection of OSA rather than localising the snore site. The upper airways are defined as the area from the nostrils
On the one hand, there are more demands from clinical practice and the lips to the vocal chords. They consist of the nasal and
to determine the accurate snore sound’s excitation location due oral cavities, the pharynx and the upper section of the larynx.
to the surgical options, which can be varied among different The pharynx is defined as the posterior section of the head and
snore sites [11], [12] and facilitate a targeted surgical plan for contains several anatomical landmarks, such as the soft palate
both of the OSA suffers and the primary snorers [13]. On the (the velum), the palatine tonsils, the posterior part of the tongue
1235
TABLE I
THE CATEGORIES OF THE VOTE CLASSIFICATION
of selected combinations of orientation and level of vibration clustering method was used in [34] to discriminate palatal and
derived from the original VOTE classification [25]. The classes non-palatal SnS. In their study, they used a combination of
of the so-called ACLTE-scheme are defined as: the statistical moment coefficients of skewness and kurtosis
r A, V level, anterior-posterior vibration calculated from the snoring sounds from subjects (n = 15) per-
r C, V or O level, concentric vibration formed with sleep nasendoscopy evaluation (under anaesthetic
r L, O level, lateral vibration condition). Ng et al. continuously reported their contributions
r T, T level, any vibration orientation in studying formants extracted from SnS [35], [36], which are
r E, E level, any vibration orientation. considered to carry important information about the status of
The resulting ACLTE-corpus contains 1 115 SnS samples the upper airway (UA). The first three formant frequencies, i.e.,
from 343 subjects, and the size of classes is strongly imbalanced F1, F2, and F3 were indicated to be associated with the degree
with the A-class making up for almost half of the samples, while of constriction in the pharynx, the degree of advancement of
the T-class is smallest with only 3 % of the samples. This reflects the tongue relative to its neutral position, and the degree of
the frequency of occurrence of different snoring patterns in the lip-rounding, respectively [35], [37]–[40].
real world, where velum snoring is relatively common, while Nevertheless, the capacity of formants to localise the anatom-
isolated tongue-base snoring is a rare phenomenon [26]. ical site of snoring was not shown in [35] (n = 40) or [36]
(n = 40) while they were demonstrated to be efficient to dif-
ferentiate apneic SnS from benign ones. Moreover, Ng et al.
III. BACKGROUND made their efforts to analyse and model both the source flow
Early work can be traced back to Schäfer and Pirsig [27], who (SF) and its derivative (SFD) of SnS via the usage of an iterative
involved five children suffering from sleep disorders and one adaptive inverse filtering approach and Gaussian probability
adult who suffered from ‘simple snoring’ (n = 6). In that study, density function [41]. In that study (n = 40) [41], the shapes
the authors claimed that, ‘simple snoring’ of the adult was due in of SF pulses are different between SnS and can be associated
large part to vibrations of the soft palate while ‘apneic snoring’ with the dynamic biomechanical properties (e. g., compliance
of the children had a pathomechanism of enlarged adenoids and elasticity) of the SnS excitation source (ES). Particularly, the
and tonsils, which resulted in an impeded movement of the soft palatal (e. g., SnS from soft palate vibration) and the pharyngeal
palate [27]. Their conclusions were based on observations of the snoring (e. g., SnS from pharyngeal wall vibration) can be
frequency spectrum of the SnS. Quinn et al. reported differences explained mainly by the theory of flutter and the concept of
in the waveform and frequency between palatal and tongue static divergence, respectively [41]–[43]. Nevertheless, Ng et al.
base snoring [28]. However, the number of subjects (n = 6) clarified in [41] that clinical experiments were not conducted
involved in their study was limited, therefore their conclusions to warrant the accuracy of the SFD model for its relation to
cannot be easily generalised. Miyazaki et al. investigated the the occurrence and development of physiological events, e. g.,
fundamental frequency (F0) values in four types of snoring, i.e., closing, opening, and speed of ES vibration during snoring.
the soft palate, the tonsil/tongue base, the combined position, Motivated by the capability of formants to represent the structure
and the larynx [29]. They indicated in their findings (n = 75) and status of the UA, Qian et al. and Wu et al. analysed the
that the average value of the fundamental frequency for the formants extracted from long duration SnS audio recordings by
aforementioned four types of snoring was 102.8 ± 34.9 Hz the K-means clustering method [44] and hidden Markov mod-
(soft palate type), 331.7 ± 144.8 Hz (tonsil/tongue base type), els (HMMs) [45], respectively. Their findings showed possible
115.7 ± 58.9 Hz (combined type), and around 250.0 Hz (larynx differences between the properties of formants extracted from
type), respectively [29]. Hill et al. studied and made a statistical different SnS related signals which may reflect the changes of
comparison (n = 11) of the crest factor (ratio of peak to root the UA structure during the night while the accurate experts’
mean square value in any given epoch) between palatal and non- annotation was missing. Also, the number of subjects was ex-
palatal snoring [30]. They concluded that palatal SnS can have tremely small (n = 1). Additionally, Qian et al. found formants
a higher crest factor than non-palatal SnS (p < .01, Student-t could also be used as an efficient marker to monitor changes
or MannWhitney tests). In another study by Hill et al. [31], of the UA by observing its tracks [46]. Xu et al. indicated in
the values of the crest factor extracted from SnS generated their study [47] (n = 30) that the first snoring sound after an
by patients (n = 5) in natural sleep showed that the snoring obstructive apnoea of the upper level (above the free margin
mechanism may change in some individuals during the night, of soft palate) may have more energy components in the lower
which means that also the snore site may change. Agrawal et al. subband than its counterpart of the lower level (below the free
calculated peak frequency, centre frequency and power ratio margin of soft palate). Peng et al. claimed in their study [48]
for their distinguishing capacity of palatal, tonge-based, and (n = 74) that, F0 and F2 were found to be lower in palatal SnS
mixed snoring [32]. In particular, they compared the snoring than that in non-palatal SnS.
sound characteristics between induced and natural sleep (n = Psychoacoustical properties combined with other acoustical
11). They claimed that induced SnS contains higher frequency features, i.e., sound pressure level ([dB], A-weighted), loud-
components than natural SnS. Saunders et al. indicated that ness (sone), sharpness (acum), roughness (cAsper), fluctuation
centre frequency may be efficient to distinguish pure palatal strength (cVacil) and centre frequency (Hz) (mean values for
from tong base snoring (n = 35), but cannot be used to identify each parameter), have been applied to SnS analysis in [49].
multisegmental snoring (the mixed snoring) [33]. A 2-means In that study, the authors summarised the statistical analysis
1237
Fig. 3. Overview of conventional ML (top) and DL (bottom) based paradimgs for the SnS classification task. In conventional ML paradigms,
human hand-crafted features (low-level descriptor (LLD) or higher representations) are extracted from the SnS audio signal via human expert
domain knowledge. Then, a classifier will make predictions using its prior knowledge acquired via the training phase. In DL paradigms (except the
DNN models trained on human hand-crafted features), DL models learn features by themselves without any human expert domain knowledge.
Then, a classifier (or a fully connected layer combined with a softmax layer) will make the final predictions based on the outputs of the trained DNN
models.
SnS annotated by performing the DISE (e. g., MPSSC), which Most of the studied LLDs are typical acoustical features (e. g.,
means the achievements might not be directly applicable to the MFCCs, F0, formants), while some others are not originally
development of natural smart home devices. Last but not least, designed for audio analysis (s. g., WEF, local binary pattern
more attention and efforts should be contributed to this research (LBP), histogram of oriented gradients (HOG)). Note that, SnS
(see Fig. 2). has similar characteristics as speech, whereas it also has some
In the following parts of this review, we will systematically properties belonging to physiological signals. These human
introduce the problems, methods, and challenges. Moreover, we hand-crafted LLDs carry important information about the snore
will discuss the current findings and limitations, and point out site and can be interpreted in the time and the frequency domain
our perspectives for future work. of the SnS. A large scale acoustical feature set, i.e., COMPARE,
and a simplified acoustical feature set, i.e., EGEMAPS, were in-
IV. METHODS vestigated for the SnS classification task (summarised in Table II
). Both the two feature sets can be extracted by our open-source
In this section, we present the methods applied to SnS classifi-
toolkit, OPENSMILE [95], [96].
cation. The ML techniques including conventional methods and
state-of-the-art DL approaches will be illustrated and described
in detail. Fig. 3 shows the general diagram of the conventional B. Higher Representations
ML and the DL based paradigms for the SnS classification task.
The aforementioned LLDs can be used directly for dynamic
ML models (e. g., HMMs [97] and Recurrent Neural Networks
A. Human Hand-crafted Low-Level Descriptors (RNNs) [98]), while higher representations (independent of the
In the conventional ML paradigm, features are designed SnS audio clip length) containing the statistical information of
by human experts with a specific domain knowledge (e. g., the LLDs over a given time are needed for training static models
medicine). Due to the similar characteristics of speech and SnS, (e. g., SVMs [61], or ELMs [65]). In this subsection, we will
early work on SnS classification tended to process the SnS data as introduce the higher representations investigated in the literature
speech. The low-level descriptors (LLDs) were firstly extracted that can be extracted from LLDs for the SnS classification task.
from frame-based SnS signals. Those LLDs may have specific 1) Statistical Functionals: The statistical functionals are cal-
physiological meanings in SnS analysis, and can be seen as culated from the frame based LLDs from a given period of the
the raw representations extracted from short-time frames of the audio signal, which include the arithmetic mean, standard devi-
analysed SnS. Table II lists the main LLDs used in published ation, extremes (minimum value, maximum value), and further
literature on SnS classification task and their corresponding more [99]. Some more advanced functionals, e. g., moments,
findings. percentiles, kurtosis, skewness, slope, and bias of the linear
1239
TABLE II
HUMAN HAND-CRAFTED LOW-LEVEL DESCRIPTORS (LLDS) FOR SNS CLASSIFICATION IN PUBLISHED LITERATURE. LPC: LINEAR PREDICTIVE CODING.
SFFS: SPECTRAL FREQUENCY FEATURES. SERS: SUBBAND ENERGY RATIOS. EMDF: EMPIRICAL DECOMPOSITION BASED FEATURES. WTE: WAVELET
TRANSFORM ENERGY. WPTE: WAVELET PACKET TRANSFORM ENERGY. WEF: WAVELET ENERGY FEATURE. GMM: GAUSSIAN MIXTURE MODEL.
RASTA-PLP: REPRESENTATIONS RELATIVE SPECTRA PERCEPTUAL LINEAR PREDICTION. SCAT: DEEP SCATTERING SPECTRUM. LBP: LOCAL BINARY
PATTERN. HOG: HISTOGRAM OF ORIENTED GRADIENTS
regression estimation of the LLDs can also be applied in this a codebook of template LLDs which was previously learnt from
method [99]. For details on OPENSMILE LLDs (i.e., COM- a certain number of training data [74]. For generating the code-
PARE and EGEMAPS), interested readers are referred to [99]. book, Schmitt et al. and their followers used the initialisation
Qian et al. further investigated and compared nine functionals step of k-means++ clustering [104], which is comparable to
(maximum, minimum, and mean values, range, standard devi- an optimised random sampling of LLDs [105] instead of the
ation, slope and bias of linear regression estimation, skewness, traditional k-means clustering [106], [107] method to improve
kurtosis) in [55], [58]. the computational speed and at the same time guarantees a
2) Bag-of-Audio-Words: The bag-of-audio-words (BoAW) comparable performance. To improve the robustness of this ap-
approach originates from the Bag-of-Words (BoW, cf. [100]) proach, the Na (assignment number) words (i.e., LLDs) with the
approach, which had been demonstrated to be efficient in natural lowest Euclidean distance are considered instead of assigning
language processing [101] and computer vision [102], [103]. In each LLD to only the most similar word in the codebook. Finally,
the BoAW approach, the numerical LLDs or alternatively the the term frequency histograms (logarithm with a bias of one) are
higher level derived features extracted from the SnS data will used as higher representations extracted from the SnS via the
first undergo a vector quantisation (VQ) step, which employs BoAW approach. The BoAW approach was first introduced by
1240
C. Deep Learning
In the past decade, DL [116] has become a very hot and
popular subject of the ML community due to its continuous
TABLE IV
THE HUMAN HAND-CRAFTED LOW-LEVEL DESCRIPTORS (LLDS) IN THE breakthroughs in speech recognition [117], image classifica-
EGEMAPS FEATURE SET. RASTA: RELATIVE SPECTRAL TRANSFORM; tion [118], and object detection [119]. With the help of a series of
HNR: HARMONICS TO NOISE RATIO; RMSE: ROOT MEAN SQUARE ENERGY nonlinear transformation of the inputs, DL models can usually
learn more robustly and generalise higher representations from
a big data size compared with the classical ML models (shal-
low architectures). Specifically, DL models can facilitate the
technique development in the domain of biomedical and health
informatics with the ever-increasing big data [120]–[122]. For
the SnS classification task, DL was found efficient in several
studies even with a limited size of data. In summary, among
those DL based models for SnS classification, there are two
typical paradigms: First, training the models with human hand-
crafted features under a deep architecture (e. g., a more hidden
layers based multi-layer perceptron (MLP) [71], [74], [78], [84],
stacked autoencoders [71], [74], [78], or deep recurrent neural
networks [123]); second, using a pre-trained deep convolutional
neural network (CNN) [124] model to learn higher representa-
tions from the SnS data (its spectrograms), or learn the higher
representations from the raw SnS data (its audio) via the CNN
plus a RNN structure (end-to-end). In the first paradigm, the
Schmitt et al. to the SnS classification task in [57]. Qian et al. human hand-crafted features are still needed, which restrains
extended this study on wavelet-based features [55], [78] and the strength of DL compared with the traditional ML models.
extended the findings by involving the BoAW approach into Therefore, we will emphasize successful applications for the
their multi-resolution analysis for SnS classification in [75]. SnS classification task via transfer learning (TL – see next Sub-
3) GMM Supervectors: The GMM supervecors are gener- section) [125] and end-to-end learning (e2e) [126]. In addition, a
ated by the GMM approach [108], [109], which was success- recent study using generative adversarial network (GAN) [127]
fully applied to text-independent speaker recognition tasks. In for addressing the data scarcity in SnS will be introduced.
essence, the GMM supervectors are the stacked mean vec- 1) Transfer Learning: This method was first introduced to
tors of Gaussian mixture components [110]. In this paradigm, SnS classification in [128], [129], by which the authors used
a universal background model (UBM) is first trained by the the TL paradigm to extract deep spectrum features from
expectation-maximization (EM) algorithm [111] from a back- spectrograms of snoring. By leveraging pre-trained CNNs
ground dataset, which includes a wide range of corpora. Then, (AlexNet [118], and VGG 19 [130]), high level representations
the GMM supervectors (usually the mean vectors) can be ex- of the spectrograms can be extracted from the activations of
tracted from the models that are adapting the UBM model via the fully connected layers of the aforementioned deep models.
the maximum a posteriori (MAP) criterion [112]. This approach It was demonstrated that these CNN descriptors can achieve
was used for the SnS classification task in [84], [91]. In par- excellent performance on SnS classification without any human
ticular, Nwe et al. extracted not only the first-order statistics expert domain knowledge. Moreover, to reduce the redundancy
1241
of the learnt deep spectrum features, Freitag et al. [129] in- in [75], very well designed features can be excellent repre-
volved a feature selection phase by applying the competitive sentations for recognising SnS even with a simple classifier
swarm optimisation (CSO) algorithm [131] to a wrapper based like Naïve Bayes (NB) [54]. Among the features, spectrum
paradigm [132]. based descriptors (e. g., MFCCs) outperformed amplitude based
2) End-to-End Learning: The e2e model was introduced in representations (e. g., crest factor). Qian et al. investigated the
the baseline work by the INTERSPEECH COMPARE snoring effects of frame sizes and overlap lengths of the analysed audio
sub-challenge [69]. As indicated by Schuller et al., one attractive chunk for extracting LLDs from SnS [71]. They indicate that
characteristic of the e2e model is that the optimal features can WEF may need a longer frame size (64 ms) than other feature
be learnt automatically from the data at hand [69]. In other sets (16 ms or 32 ms). In addition, the higher representation
words, feature engineering work needing much of human ex- extraction methods (cf. Section IV-B) are essential for final
perts’ efforts (e. g., acoustic and medical knowledge for snoring) performance. But a direct comparison between methods (e. g.,
is excluded in that paradigm. In the baseline e2e model [69], BoAW vs FV) is still missing.
a convolutional neural network was used to extract features For the DL paradigm, the main limitation is data size, which
from raw time representations of SnS data, and a subsequent constrains the power of deep models to learn robust and gen-
recurrent neural network (with long short-term memory (LSTM) eralise representations from the SnS data. Encouragingly, DL
cells [133]) was adopted to perform the final classification, has demonstrated that some efficient high level representation
which was similar to the model first applied successfully to a can be extracted automatically from the SnS without any hu-
speech emotion recognition task [134]. A dual convolutional man expert knowledge [128], [129], [135], [137]. In particular,
layer topology was proposed by Wang et al. in [135], by which CNN layers were found superior to RNN layers in extracting
the outputs of two separate convolutional layers (having different features for SnS classification [137]. In fact, directly using a
kernel dimension on the frequency axis, but equal dimension on CNN+LSTM architecture did not reach an excellent perfor-
the time axis) were merged via the element-wise average. Subse- mance in an early study [69]. The RNN based models were
quently, a channel slice model (instead of fully connected layers) found to be efficient when a data augmentation phase was in-
and two reccurrent layers (with a gated recurrent unit (GRU) volved [123]; they reached a UAR at 67.4 % on the development
cell [136]—a simpler structure compared to LSTM) were used set, while the performance decreased on the test set (UAR at
to implement the classification capacity. Schmitt and Schuller 54.4 %). But likely their main contribution to the SnS literature
made an in-depth investigation on different topologies of e2e for were their proposed scGANs, which were successfully validated
SnS classification [137]. They claimed in their findings that a in both the static acoustic data and the sequential acoustic
convolutional layer followed by a pooling step was superior to data [123], which was demonstrated to outperform other conven-
an LSTM layer. tional data augmentation methods (e. g., the synthetic minority
3) Generative Adversarial Network: Zhang et al. were the oversampling technique (SMOTE) [138], and a transformation
first group introducing GANs [127] to the SnS classifica- method [139]).
tion task [123], which provides a solution for addressing the One significant finding is that the multi-resolution method
data scarcity (specifically the annotated data) problem in al- (e. .g, wavelets) is very efficient for SnS classification. Qian et al.
most all intelligent healthcare topics. The authors proposed extensively validated their wavelet based approaches for SnS
the semi-supervised conditional generative adversarial networks classification in [55], [58], [71], [74], [75], [78]. This finding
(scGANs), which can automatically generate data by mapping was also supported by the work in [135], in which Wang et al.
a random noise space to the original data distribution. In doing found that fusing the global and local frequency information by
this, one can simulate an infinite number of training data without using different kernel sizes of CNN models can facilitate the
the need of an additionally exhausting human expert annotation extraction of deep representations from snoring.
process due to a generation process. Fig. 4 shows the UARs achieved by different models which
Further, by integration of the semi-supervised paradigm, sc- achieved better results than the MPSSC baseline in recent years.
GANs require only one model to synthesise different categorical The current best result on the test set (p < .001 by one-tailed
SnS data. Moreover, an ensemble of scGANs are employed to z-test, compared to the baseline) was achieved by Demir et al.
overcome the model collapse issue when generating the data. in [88]. They used the LLDs extracted from spectrograms of SnS
via image processing methods. However, we should note that
V. DISCUSSION there was a big gap between the performance on the development
and test sets (37.8 % vs 72.6 % of UAR) in their study. We can
In this section, we discuss the findings showing interesting
find this phenomenon in almost any other studies based on the
scientific significance of the current studies. Also limitations
MPSSC database. We think this could be due to the factor that
in the work covered by this literature review will be given. In
MPSSC has different data collection environment conditions
addition, we indicate some possible future directions, which may
and acoustic property distributions among the partitions. One
help facilitate attracting more work to this topic.
exception is the work done by Vesperini et al. [84], in which their
model had an excellent performance on both the development
A. Current Findings and the test sets (67.1 % vs 67.7 % UAR). In their proposed
Generally speaking, in the conventional ML paradigm, there method, a well designed MLP based deep model (with specifi-
is no huge gap between the performance among different ML cally tuned hyper-parameters) was used, which might need large
models, while the features matter indeed [74]. As demonstrated amount of efforts from experienced AI experts.
1242
and the dimensionality of the feature space into account. To this overcoming the challenges posed by SnS localisation. Compared
end, selecting efficient and robust features, or feature reduction to other applications in AI for healthcare, acoustical analysis of
methods, can be a good direction in future SnS classification SnS is a younger field, which means that we still have insufficient
tasks. Qian et al. systematically evaluated the contributions of fundamental knowledge about the acoustical properties of SnS.
each feature set to SnS classification, but their method involved Moreover, the availability of publicly accessible databases is
human experts’ effects [58]. In future work, automatic feature also extremely limited, which constrains the relevant studies.
selection approaches will be more telling. For late fusion, finding Deep learning methods are promising, but there is a far way to
an efficient voting strategy is the key to a successful imple- go to build a robust and explainable system for SnS analysis.
mentation. In a recent doctoral thesis [74], two popular voting In the discussion, we shared final insights and perspectives. We
strategies were compared, i.e., majority voting (MV) and margin think that the combination of the conventional solid knowledge
sampling voting (MSV). The former one is based on the major in signal processing and machine learning together with the
prediction made by an ensemble of ML models while the latter increasingy advanced deep learning methods can leverage the
one is based on the prediction made by the ML model which power of AI to finally provide a robust and accurate system for
achieved the highest margin sampling value [148] (the difference the non-invasive localisation of the snoring site via an audio
between the first and the second highest posterior probability). based approach.
In that study [74], MV outperformed MSV in late fusion of
multiple ML models for SnS classification. Future work could
REFERENCES
explore more generalised late fusion strategies, specifically, for
evaluating the confidence level of the trained ML models. [1] M. Lechner, C. E. Breeze, M. M. Ohayon, and B. Kotecha, “Snoring and
breathing pauses during sleep: Interview survey of a United Kingdom
4) Data Enrichment: We need to face and address one seri- population sample reveals a significant increase in the rates of sleep
ous challenge almost for all AI applications in medicine: data apnoea and obesity over the last 20 years-data from the U.K. sleep survey,”
scarcity. It is relatively easy to collect a large amount of SnS, Sleep Med., vol. 54, pp. 250–256, 2019.
[2] D. Pevernagie, R. M. Aarts, and M. De Meyer, “The acoustics of snoring,”
whereas the annotation work is expensive, time-consuming, Sleep Med. Rev., vol. 14, no. 2, pp. 131–144, 2010.
and even not sufficiently accurate. In particular, for SnS, its [3] A. Roebuck et al., “A review of signals used in sleep analysis,” Physiol.
natural imbalanced characteristic [26] cannot be ignored. Take Meas., vol. 35, no. 1, pp. R1–R57, 2014.
[4] F. Mendonça, S. S. Mostafa, A. G. Ravelo-García, F. Morgado-Dias, and
the VOTE-category as an example, SnS belonging to the V and T. Penzel, “A review of obstructive sleep apnea detection approaches,”
the O class occupy 84.5 % in MPSSC while T and E type snoring IEEE J. Biomed. Health Inf., vol. 23, no. 2, pp. 825–837, 2019.
samples only account for 4.7 %, and 10.8 %, respectively [13]. [5] D. J. Eckert and A. Malhotra, “Pathophysiology of adult obstructive sleep
apnea,” Proc. Amer. Thoracic Soc., vol. 5, no. 2, pp. 144–153, 2008.
To overcome this issue, Zhang et al. proposed the scGANs based [6] P. J. Strollo Jr and R. M. Rogers, “Obstructive sleep apnea,” N. Eng. J.
system, which was demonstrated to be more efficient than other Med., vol. 334, no. 2, pp. 99–104, 1996.
classical data augmentation methods. In future work, some other [7] C. V. Senaratna et al., “Prevalence of obstructive sleep apnea in the
general population: A systematic review,” Sleep Med. Rev., vol. 34,
state-of-the-art methods like unsupervised learning [149], semi- pp. 70–81, 2017.
supervised learning [150], active learning [151], and cooperative [8] P. Smith et al., “Indications and standards for use of nasal continuous
learning [152] are worth being explored for the SnS classification positive airway pressure (CPAP) in sleep-apnea syndromes,” Amer. J.
Respir. Critical Care Med., vol. 150, no. 6, pp. 1738–1745, 1994.
task. [9] T. Young, M. Palta, J. Dempsey, J. Skatrud, S. Weber, and S. Badr, “The
5) Open Resources: Reproducibility is crucial for a sus- occurrence of sleep-disordered breathing among middle-aged adults,” N.
tainable research. We encourage more researchers who share Eng. J. Med., vol. 328, no. 17, pp. 1230–1235, 1993.
[10] B. Mokhlesi, S. Ham, and D. Gozal, “The effect of sex and age on the
the same interests in SnS classification to contribute to open comorbidity burden of OSA: An observational analysis from a large
resources (e. g., databases, toolkits). Before MPSSC, there was nationwide US health claims database,” The Eur. Respir. J., vol. 47, no. 4,
no significant public SnS database available. We also released pp. 1162–1169, 2016.
[11] K. K. Li, “Surgical therapy for adult obstructive sleep apnea,” Sleep Med.
our toolkits like OPENSMIE [95], [96], OPENXBOW [153], AU- Rev., vol. 9, no. 3, pp. 201–209, 2005.
DEEP [154], and END2YOU [155], which include both the state- [12] H.-C. Lin, M. Friedman, H.-W. Chang, and B. Gurpinar, “The efficacy
of-the-art conventional ML and DL paradigms. It will be very of multilevel surgery of the upper airway in adults with obstructive sleep
apnea/hypopnea syndrome,” Laryngoscope, vol. 118, no. 5, pp. 902–908,
helpful to make a fair and efficient comparison on algorithms 2008.
and systems for automatically localising SnS. Specifically, we [13] C. Janott et al., “Snoring classified: The Munich-Passau snore sound
hope SnS collected in natural sleep can be added into this field, corpus,” Comput. Biol. Med., vol. 94, pp. 106–118, 2018.
[14] A. V. Vroegop et al., “Drug-induced sleep endoscopy in sleep-disordered
which will significantly facilitate a real application in clinical or breathing: Report on 1,249 cases,” Laryngoscope, vol. 124, no. 3,
home based situations. pp. 797–802, 2014.
[15] M. Reda, G. J. Gibson, and J. A. Wilson, “Pharyngoesophageal pressure
monitoring in sleep apnea syndrome,” Otolaryngol.–Head Neck Surg.,
VI. CONCLUSION vol. 125, no. 4, pp. 324–331, 2001.
[16] H. Demin, Y. Jingying, W. J. Y. Qingwen, L. Yuhua, and W. Jiangy-
This article provided a comprehensive review of the research ong, “Determining the site of airway obstruction in obstructive sleep
using audio data to localise snore sites. While the mechanism apnea with airway pressure measurements during sleep,” Laryngoscope,
vol. 112, no. 11, pp. 2081–2085, 2002.
of snoring is clear, there are various definitions of the categories [17] B. A. Stuck and J. T. Maurer, “Airway evaluation in obstructive sleep
of the snore site. We also compared both traditional machine apnea,” Sleep Med. Rev., vol. 12, no. 6, pp. 411–436, 2008.
learning and state-of-the-art deep learning technologies and gave [18] F. Dalmasso and R. Prota, “Snoring: Analysis, measurement, clinical
implications and applications,” Eur. Respir. J., vol. 9, no. 1, pp. 146–159,
a detailed analysis how they can be used, and to what extent, for 1996.
1244
[19] M. Friedman, H. Ibrahim, and L. Bass, “Clinical staging for sleep- [44] K. Qian, Y. Fang, Z. Xu, and H. Xu, “All night analysis of snoring signals
disordered breathing,” Otolaryngol. Head Neck Surg., vol. 127, no. 1, by formant features,” in Proc. Int. Conf. Comput. Sci. Electron. Eng.,
pp. 13–21, 2002. Hangzhou, P. R. China, 2013, pp. 984–987.
[20] K. Iwanaga et al., “Endoscopic examination of obstructive sleep apnea [45] Y. Wu, Z. Zhao, K. Qian, Z. Xu, and H. Xu, “Analysis of long duration
syndrome patients during drug-induced sleep.” Acta Oto-Laryngol., no. snore related signals based on formant features,” in Proc. ITA, Chengdu,
550, pp. 36–40, 2003. P. R. China, 2013, pp. 91–95.
[21] V. Abdullah, Y. Wing, and C. Van Hasselt, “Video sleep nasendoscopy: [46] K. Qian, Y. Fang, and H. Xu, “A method for monitoring the cariations in
the hong kong experience,” Otolaryng. Clin. North Amer., vol. 36, no. 3, the upper airway of individual osahs patients by observing two acoustic
pp. 461–471, 2003. feature tracks,” in Proc. Appl. Mech. Mater., vol. 380–384, Trans Tech
[22] C. Vicini et al., “The nose oropharynx hypopharynx and larynx (NOHL) Publications Ltd., Stafa-Zurich, Switzerland, 2013, pp. 971–974.
classification: A new system of diagnostic standardized examination [47] H. Xu, W. Huang, L. Yu, and L. Chen, “Sound spectral analysis of snoring
for osahs patients,” Eur. Arch. Oto-Rhino-Laryngol., vol. 269, no. 4, sound and site of obstruction in obstructive sleep apnea syndrome,” Acta
pp. 1297–1300, 2012. Oto-Laryngol., vol. 130, no. 10, pp. 1175–1179, 2010.
[23] J. Schaefer, “How can one recognize a velum snorer?” Laryngorhi- [48] H. Peng et al., “Acoustic analysis of snoring sounds originating from
nootologie, vol. 68, no. 5, pp. 290–294, May 1989. different sources determined by drug-induced sleep endoscopy,” Acta
[24] E. J. Kezirian, W. Hohenhorst, and N. de Vries, “Drug-induced sleep Oto-Laryngol., vol. 137, no. 8, pp. 872–876, 2017.
endoscopy: The VOTE classification,” Eur. Arch. Oto-Rhino-Laryngol., [49] M. Herzog et al., “Evaluation of acoustic characteristics of snoring sounds
vol. 268, no. 8, pp. 1233–1236, 2011. obtained during drug-induced sleep endoscopy,” Sleep Breathing, vol. 3,
[25] C. Janott et al., “VOTE versus ACLTE: Comparison of two snoring noise no. 19, pp. 1011–1019, 2014.
classifications using machine learning methods,” HNO, vol. 67, no. 9, [50] K. Qian, Y. Fang, Z. Xu, and H. Xu, “Comparison of two acoustic features
pp. 670–678, 2019. for classification of different snore signals,” Chin. J. Electron Devices,
[26] N. S. Hessel and N. de Vries, “Diagnostic work-up of socially un- vol. 36, no. 4, pp. 455–459, 2013.
acceptable snoring,” Eur. Arch. Oto-Rhino-Laryngol., vol. 259, no. 3, [51] K. Qian, Z. Xu, H. Xu, and B. P. Ng, “Automatic detection of inspiration
pp. 158–161, 2002. related snoring signals from original audio recording,” in Proc. ChinaSIP,
[27] J. Schäfer and W. Pirsig, “Digital signal analysis of snoring sounds in Xi’an, China, 2014, pp. 95–99.
children,” Int. J. Pediatr. Otorhinolaryngol., vol. 20, no. 3, pp. 193–202, [52] K. Qian, Z. Xu, H. Xu, Y. Wu, and Z. Zhao, “Automatic detection,
1990. segmentation and classification of snore related signals from overnight
[28] S. Quinn, L. Huang, P. Ellis, and J. F. Williams, “The differentiation audio recording,” IET Signal Process., vol. 9, no. 1, pp. 21–29, 2015.
of snoring mechanisms using sound analysis,” Clin. Otolaryngol. Allied [53] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
Sci., vol. 21, no. 2, pp. 119–123, 1996. Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967.
[29] S. Miyazaki, Y. Itasaka, K. Ishikawa, and K. Togawa, “Acoustic analysis [54] M. N. Murty and V. S. Devi, Pattern Recognition: An Algorithmic
of snoring and the site of airway obstruction in sleep related respiratory Approach. Dordrecht, Netherlands: Springer Science & Business Media,
disorders,” Acta Oto-Laryngol., vol. 118, no. 537, pp. 47–51, 1998. 2011.
[30] P. Hill, B. Lee, J. Osborne, and E. Osman, “Palatal snoring identi- [55] K. Qian, C. Janott, Z. Zhang, C. Heiser, and B. Schuller, “Wavelet features
fied by acoustic crest factor analysis,” Physiol. Meas., vol. 20, no. 2, for classification of VOTE snore sounds,” in Proc. Int. Conf. Acoust.,
pp. 167–174, 1999. Speech, Signal Process., Shanghai, P. R. China, 2016, pp. 221–225.
[31] P. Hill, E. Osman, J. Osborne, and B. Lee, “Changes in snoring during [56] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009
natural sleep identified by acoustic crest factor analysis at different times emotion challenge,” in Proc. INTERSPEECH, Brighton, U.K., 2009,
of night,” Clin. Otolaryngol. Allied Sci., vol. 25, no. 6, pp. 507–510, pp. 312–315.
2000. [57] M. Schmitt et al., “A bag-of-audio-words approach for snore sounds exci-
[32] S. Agrawal, P. Stone, K. McGuinness, J. Morris, and A. Camilleri, tation localisation,” in Proc. ITG Speech Commun., Paderborn, Germany,
“Sound frequency analysis and the site of snoring in natural and in- 2016, pp. 230–234.
duced sleep,” Clin. Otolaryngol. Allied Sci., vol. 27, no. 3, pp. 162–166, [58] K. Qian et al., “Classification of the excitation location of snore sounds
2002. in the upper airway by acoustic multi-feature analysis,” IEEE Trans.
[33] N. Saunders, P. Tassone, G. Wood, A. Norris, M. Harries, and B. Kotecha, Biomed. Eng., vol. 64, no. 8, pp. 1731–1741, 2017.
“Is acoustic analysis of snoring an alternative to sleep nasendoscopy?” [59] N. E. Huang et al., “The empirical mode decomposition and the hilbert
Clin. Otolaryngol. Allied Sci., vol. 29, no. 3, pp. 242–246, 2004. spectrum for nonlinear and non-stationary time series analysis,” in Proc.
[34] R. J. Beeton, I. Wells, P. Ebden, H. Whittet, and J. Clarke, “Snore site Roy. Soc. London A: Math., Phys. Eng. Sci., vol. 454, no. 1971. The Royal
discrimination using statistical moments of free field snoring sounds Society, 1998, pp. 903–995.
recorded during sleep nasendoscopy,” Physiol. Meas., vol. 28, no. 10, [60] R. A. Fisher, “The use of multiple measurements in taxonomic problems,”
pp. 1225–1236, 2007. Ann. Eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[35] A. K. Ng, T. San Koh, E. Baey, T. H. Lee, U. R. Abeyratne, and [61] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
K. Puvanendran, “Could formant frequencies of snore signals be an vol. 20, no. 3, pp. 273–297, 1995.
alternative means for the diagnosis of obstructive sleep apnea?” Sleep [62] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32,
Med., vol. 9, no. 8, pp. 894–898, 2008. 2001.
[36] A. K. Ng, T. San Koh, E. Baey, and K. Puvanendran, “Role of upper [63] C. M. Bishop, Pattern Recognit. Mach. Learn. New York, NY, USA:
airway dimensions in snore production: Acoustical and perceptual find- Springer, 2006.
ings,” Ann. Biomed. Eng., vol. 37, no. 9, pp. 1807–1817, 2009. [64] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A
[37] T. Murry and R. C. Bone, “Acoustic characteristics of speech fol- new learning scheme of feedforward neural networks,” in Proc. IJCNN,
lowing uvulopalatopharyngoplasty,” Laryngoscope, vol. 99, no. 12, Budapest, Hungary, 2004, pp. 985–990.
pp. 1217–1219, 1989. [65] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
[38] J. R. Deller Jr, J. H. L. Hansen, and J. G. Proakis, Discrete Time Process. Theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501,
Speech Signals. New York, NY, USA: Wiley-IEEE Press, 1999. 2006.
[39] A. Behrman, M. J. Shikowitz, and S. Dailey, “The effect of upper [66] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning
airway surgery on voice,” Otolaryngol. Head Neck Surg., vol. 127, no. 1, machine for regression and multiclass classification,” IEEE Trans. Syst.,
pp. 36–42, 2002. Man, and Cybern., Part B (Cybern.), vol. 42, no. 2, pp. 513–529,
[40] G. Bertino, E. Matti, S. Migliazzi, F. Pagella, C. Tinelli, and M. Benazzo, Apr. 2012.
“Acoustic changes in voice after surgery for snoring: Preliminary results,” [67] K. Kira and L. A. Rendell, “The feature selection problem: Traditional
Acta Otorhinolaryngol. Italica, vol. 26, no. 2, p. 110, 2006. methods and a new algorithm,” in Proc. AAAI, vol. 2, San Jose, USA,
[41] A. K. Ng and T. San Koh, “Analysis and modeling of snore source flow 1992, pp. 129–134.
with its preliminary application to synthetic snore generation,” IEEE [68] I. Kononenko, E. Šimec, and M. Robnik-Šikonja, “Overcoming the
Trans. Biomed. Eng., vol. 57, no. 3, pp. 552–560, Mar. 2010. myopia of inductive learning algorithms with RELIEFF,” Appl. Intell.,
[42] L. Huang, S. J. Quinn, P. D. Ellis, and J. E. F. Williams, “Biomechanics vol. 7, no. 1, pp. 39–55, 1997.
of snoring,” Endeavour, vol. 19, no. 3, pp. 96–100, 1995. [69] B. Schuller et al., “The INTERSPEECH 2017 computational paralinguis-
[43] L. Huang, “Mechanical modeling of palatal snoring,” The J. Acoustical tics challenge: Addressee, cold & snoring,” in Proc. INTERSPEECH,
Soc. Amer., vol. 97, no. 6, pp. 3642–3648, 1995. Stockholm, Sweden, 2017, pp. 3442–3446.
1245
[70] T. Jones, M. Ho, J. Earis, A. Swift, and P. Charters, “Acoustic parameters [95] F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE–the Munich ver-
of snoring sound to compare natural snores with snores during steady- satile and fast open-source audio feature extractor,” in Proc. ACM MM,
statepropofol sedation,” Clin. Otolaryngol., vol. 31, no. 1, pp. 46–52, Florence, Italy, 2010, pp. 1459–1462.
2006. [96] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments
[71] K. Qian et al., “Teaching machines on snoring: A benchmark on com- in openSMILE, the Munich open-source multimedia feature extractor,”
puter audition for snore sound excitation localisation,” Archives Acoust., in Proc. ACM MM, Barcelona, Spain, 2013, pp. 835–838.
vol. 43, no. 3, pp. 465–475, 2018. [97] L. Rabiner and B. Juang, “An introduction to hidden markov models,”
[72] S. McCandless, “An algorithm for automatic formant extraction using IEEE ASSP Mag., vol. 3, no. 1, pp. 4–16, 1986.
linear prediction spectra,” IEEE Trans. Acoust., Speech, and Signal [98] J. L. Elman, “Finding structure in time,” Cognitive Sci., vol. 14, no. 2,
Process., vol. 22, no. 2, pp. 135–141, Apr. 1974. pp. 179–211, 1990.
[73] R. C. Snell and F. Milinazzo, “Formant location from LPC analysis data,” [99] F. Eyben, Real-time Speech and Music Classification by Large Audio
IEEE Trans. Speech Audio Process., vol. 1, no. 2, pp. 129–134, Apr. 1993. Feature Space Extraction. Cham, Switzerland: Springer International
[74] K. Qian, Autom. Gen. Audio Signal Classification. Munich, Germany: Publishing, 2015, Doctoral Thesis.
Technical University of Munich, 2018, Doctoral Thesis. [100] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2–3,
[75] K. Qian et al., “A bag of wavelet features for snore sound classification,” pp. 146–162, 1954.
Ann. Biomed. Eng., vol. 47, no. 4, pp. 1000–1011, 2019. [101] F. Weninger, P. Staudt, and B. Schuller, “Words that fascinate the listener:
[76] D. O’Shaughnessy, Speech Commun.: Human Mach. New York, NY, Predicting affective ratings of on-line lectures,” Int. J. Distance Edu.
USA: Addison-Wesley, 1987. Technol., vol. 11, no. 2, pp. 110–123, 2013.
[77] R. N. Khushaba, Appl. Biosignal-driven Intell. Syst. for Multifunction [102] J. Sivic and A. Zisserman, “Efficient visual search of videos cast as
Prosthesis Control. Sydney, Australia: University of Technology Sydney, text retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 4,
2010, Doctoral Thesis. pp. 591–606, 2009.
[78] K. Qian et al., “Snore sound recognition: On wavelets and classifiers [103] J. Wu, W.-C. Tan, and J. M. Rehg, “Efficient and effective visual code-
from deep nets to kernels,” in Proc. EMBC, Jeju Island, Korea, 2017, book generation using additive kernels,” J. Mach. Learn. Res., vol. 12,
pp. 3737–3740. pp. 3097–3118, Nov. 2011.
[79] R. N. Khushaba, S. Kodagoda, S. Lal, and G. Dissanayake, “Driver [104] D. Arthur and S. Vassilvitskii, “K-means++: The advantages of careful
drowsiness classification using fuzzy wavelet-packet-based feature- seeding,” in Proc. ACM-SIAM SODA, New Orleans, LA, USA, 2007,
extraction algorithm,” IEEE Trans. Biomed. Eng., vol. 58, no. 1, pp. 1027–1035.
pp. 121–131, 2011. [105] S. Rawat, P. F. Schulam, S. Burger, D. Ding, Y. Wang, and F. Metze,
[80] A. M V, Rao, S. Yadav, and P. Ghosh, Kumar, “A dual source-filter model “Robust audio-codebooks for large-scale event detection in consumer
of snore audio for snorer group classification,” in Proc. INTERSPEECH, videos,” in Proc. INTERSPEECH, Lyon, France, 2013, pp. 2929–2933.
Stockholm, Sweden, 2017, pp. 3502–3506. [106] C. M. Bishop, Pattern Recognition and Mach. Learn. New York, USA:
[81] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” Springer, 2006.
The J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738–1752, 1990. [107] S. Pancoast and M. Akbacak, “Bag-of-audio-words approach for mul-
[82] H. Kaya and K. A. Alexey, “Introducing weighted kernel classi- timedia event classification,” in Proc. INTERSPEECH, Portland, OR,
fiers for handling imbalanced paralinguistic corpora: Snoring, ad- USA, 2012, pp. 2105–2108.
dressee and cold,” in Proc. INTERSPEECH, Stockholm, Sweden, 2017, [108] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker iden-
pp. 3527–3531. tification using Gaussian mixture speaker models,” IEEE Trans. Speech
[83] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEE Audio Process., vol. 3, no. 1, pp. 72–83, 1995.
Trans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, Oct. 1994. [109] B. L. Pellom and J. H. Hansen, “An efficient scoring algorithm for Gaus-
[84] F. Vesperini, A. Galli, L. Gabrielli, E. Principi, and S. Squartini, sian mixture model based speaker identification,” IEEE Signal Process.
“Snore sounds excitation localization by using scattering transform and Lett., vol. 5, no. 11, pp. 281–284, 1998.
deep neural networks,” in Proc. IJCNN, Rio de Janeiro, Brazil, 2018, [110] C. H. You, K. A. Lee, and H. Li, “GMM-SVM kernel with a
pp. 1–8. Bhattacharyya-based distance for speaker recognition,” IEEE Trans.
[85] J. Andén and S. Mallat, “Deep scattering spectrum,” IEEE Trans. Signal Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1300–1312, 2010.
Process., vol. 62, no. 16, pp. 4114–4128, 2014. [111] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal
[86] S. Mallat, “Group invariant scattering,” Commun. Pure Appl. Mathemat- Process. Mag., vol. 13, no. 6, pp. 47–60, 1996.
ics, vol. 65, no. 10, pp. 1331–1398, 2012. [112] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for
[87] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale multivariate Gaussian mixture observations of Markov chains,” IEEE
and rotation invariant texture classification with local binary patterns,” Trans. Speech Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 7, pp. 971–987, [113] A. Bhattacharyya, “On a measure of divergence between two statistical
2002. populations defined by their probability distributions,” Bulletin Calcutta
[88] F. Demir, A. Sengur, N. Cummins, S. Amiriparian, and B. Schuller, “Low Math. Soc., vol. 35, pp. 99–109, 1943.
level texture features for snore sound discrimination,” in Proc. EMBC, [114] S. Kullback and R. A. Leibler, “On information and sufficiency,” The
Honolulu, HI, USA, 2018, pp. 413–416. Ann. Math. Stat., vol. 22, no. 1, pp. 79–86, 1951.
[89] N. Dalal and B. Triggs, “Histograms of oriented gradients for hu- [115] F. Perronnin and C. Dance, “Fisher kernels on visual vocabularies for
man detection,” in Proc. CVPR, vol. 1, San Diego, CA, USA, 2005, image categorization,” in Proc. CVPR, Minneapolis, MN, USA, 2007,
pp. 886–893. pp. 1–8.
[90] B. Schuller et al., “The INTERSPEECH 2013 computational paralin- [116] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
guistics challenge: Social signals, conflict, emotion, autism,” in Proc. no. 7553, pp. 436–444, 2015.
INTERSPEECH, Lyon, France, 2013, pp. 148–152. [117] G. Hinton et al., “Deep neural networks for acoustic modeling in speech
[91] L. T. Nwe, D. H. Tran, T. Z. W. Ng, and B. Ma, “An integrated solution recognition: The shared views of four research groups,” IEEE Signal
for snoring sound classification using Bhattacharyya distance based Process. Mag., vol. 29, no. 6, pp. 82–97, 2012.
GMM supervectors with SVM, feature selection with random forest and [118] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
spectrogram with CNN,” in Proc. INTERSPEECH, Stockholm, Sweden, with deep convolutional neural networks,” in Proc. NIPS, Lake Tahoe,
2017, pp. 3467–3471. NV, USA, 2012, pp. 1097–1105.
[92] G. Gosztolya, R. Busa-Fekete, T. Grósz, and L. Tóth, “DNN-based feature [119] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object
extraction and classifier combination for child-directed speech, cold and detection,” in Proc. NIPS, Stateline, NV, USA, 2013, pp. 2553–2561.
snoring identification,” in Proc. INTERSPEECH, Stockholm, Sweden, [120] D. Ravì et al., “Deep learning for health informatics,” IEEE J. Biomed.
2017, pp. 3522–3526. Health Informat., vol. 21, no. 1, pp. 4–21, 2017.
[93] G. Gosztolya and R. Busa-Fekete, “Posterior calibration for multi- [121] J. Andreu-Perez, C. C. Poon, R. D. Merrifield, S. T. Wong, and G.-Z.
class paralinguistic classification,” in Proc. SLT, Athens, Greece, 2018, Yang, “Big data for health,” IEEE J. Biomed. and Health Inf., vol. 19,
pp. 119–125. no. 4, pp. 1193–1208, 2015.
[94] F. Eyben et al., “The Geneva minimalistic acoustic parameter set [122] M. Viceconti, P. Hunter, and R. Hose, “Big data, big knowledge: Big data
(GeMAPS) for voice research and affective computing,” IEEE Trans. for personalized healthcare,” IEEE J. Biomed. Health Inf., vol. 19, no. 4,
Affective Comput., vol. 7, no. 2, pp. 190–202, 2015. pp. 1209–1215, 2015.
1246
[123] Z. Zhang, J. Han, K. Qian, C. Janott, Y. Guo, and B. Schuller, “Snore- [141] A. Adadi and M. Berrada, “Peeking inside the black-box: A survey on
gans: Improving automatic snore sound classification with synthesized explainable artificial intelligence (XAI),” IEEE Access, vol. 6, pp. 52 138–
data,” IEEE J. Biomed. Health Inf., vol. 24, no. 1, pp. 300–310, 2020. 52 160, 2018.
[124] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, [142] S. M. Lundberg et al., “Explainable machine-learning predictions for the
W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with prevention of hypoxaemia during surgery,” Nature Biomed. Eng., vol. 2,
a back-propagation network,” in Proc. NIPS, Denver, CO, USA, 1989, no. 10, pp. 749–760, 2018.
pp. 396–404. [143] H. Lee et al., “An explainable deep-learning algorithm for the detection
[125] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans. of acute intracranial haemorrhage from small datasets,” Nature Biomed.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010. Eng., vol. 3, no. 3, p. 173, 2019.
[126] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” [144] S. Sarkar, T. Weyde, A. Garcez, G. G. Slabaugh, S. Dragicevic, and
in Proc. ICASSP, Florence, Italy, 2014, pp. 6964–6968. C. Percy, “Accuracy and interpretability trade-offs in machine learn-
[127] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Neural Inf. ing applied to safer gambling,” in Proc. CEUR Workshop, vol. 1773,
Process. Syst., Montreal, Canada, 2014, pp. 2672–2680. Barcelona, Spain, 2016, pp. 1–9.
[128] S. Amiriparian et al., “Snore sound classification using image-based deep [145] Z. Ren, Q. Kong, J. Han, M. Plumbley, and B. Schuller, “Attention-based
spectrum features,” in Proc. INTERSPEECH, Stockholm, Sweden, 2017, atrous convolutional neural networks: Visualisation and understanding
pp. 3512–3516. perspectives of acoustic scenes,” in Proc. Int. Conf. Acoustics, Speech,
[129] M. Freitag, S. Amiriparian, N. Cummins, M. Gerczuk, and B. Schuller, Signal Process., Brighton, U.K., 2019, pp. 56–60.
“An end-to-evolutionhybrid approach for snore sound classification,” in [146] J. Han, Z. Zhang, Z. Ren, and B. Schuller, “Implicit fusion by joint
Proc. INTERSPEECH, Stockholm, Sweden, 2017, pp. 3507–3511. audiovisual training for emotion recognition in mono modality,” in Proc.
[130] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Int. Conf. Acoustics, Speech, Signal Process., Brighton, U.K., 2019,
large-scale image recognition,” 2014, arXiv:1409.1556. pp. 5861–5865.
[131] R. Cheng and Y. Jin, “A competitive swarm optimizer for large scale op- [147] M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction of
timization,” IEEE Trans. Cybern., vol. 45, no. 2, pp. 191–204, Feb. 2015. spontaneous affect from multiple cues and modalities in valence-arousal
[132] S. Gu, R. Cheng, and Y. Jin, “Feature selection for high-dimensional clas- space,” IEEE Trans. Affective Comput., vol. 2, no. 2, pp. 92–105, 2011.
sification using a competitive swarm optimizer,” Soft Comput., vol. 22, [148] T. Scheffer, C. Decomain, and S. Wrobel, “Active hidden markov mod-
no. 3, pp. 811–822, 2018. els for information extraction,” in Proc. IDA, Cascais, Portugal, 2001,
[133] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural pp. 309–318.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [149] M. Weber, M. Welling, and P. Perona, “Unsupervised learning of models
[134] G. Trigeorgis et al., “Adieu features? End-to-end speech emotion recog- for recognition,” in Proc. Eur. Conf. Comput. Vis., Dublin, Ireland, 2000,
nition using a deep convolutional recurrent network,” in Proc. ICASSP, pp. 18–32.
Shangai, P. R. China, 2016, pp. 5200–5204. [150] X. Zhu and A. B. Goldberg, “Introduction to semi-supervised learning,”
[135] J. Wang, H. Strömfeli, and B. W. Schuller, “A CNN-GRU approach to Synth. Lectures Artif. Intell. Mach. Learn., vol. 3, no. 1, pp. 1–130, 2009.
capture time-frequency pattern interdependence for snore sound classi- [151] B. Settles, “Active learning literature survey,” University of Wisconsin–
fication,” in Proc. EUSIPCO, Rome, Italy, 2018, pp. 997–1001. Madison, Madison, WI, USA, Computer Sciences Technical Report
[136] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation 1648, 2009.
of gated recurrent neural networks on sequence modeling,” in Proc. [152] Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning
NIPS Deep Learn. Represent. Learn. Workshop, Montreal, Canada, 2014, and its application to emotion recognition from speech,” IEEE/ACM
pp. 1–9. Trans. Audio, Speech, Language Process., vol. 23, no. 1, pp. 115–126,
[137] M. Schmitt and B. Schuller, “End-to-end audio classification with small 2015.
datasets–Making it work,” in Proc. EUSIPCO, A Coruña, Spain, 2019, [153] M. Schmitt and B. W. Schuller, “openXBOW-introducing the passau
pp. 1–5. open-source crossmodal bag-of-words toolkit,” J. Mach. Learn. Res.,
[138] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: vol. 18, no. 96, pp. 1–5, 2017.
Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, [154] M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and
pp. 321–357, 2002. B. Schuller, “auDeep: Unsupervised learning of representations from
[139] D. e. Amodei, “Deep speech 2: End-to-end speech recognition in english audio with deep recurrent neural networks,” J. Mach. Learn. Res., vol. 18,
and mandarin,” in Proc. ICML, New York, NY, USA, 2016, pp. 173–182. no. 1, pp. 6340–6344, 2017.
[140] A. Azarbarzin and Z. Moussavi, “Do anthropometric parameters change [155] P. Tzirakis, S. Zafeiriou, and B. Schuller, “End2You–The impe-
the characteristics of snoring sound?” in Proc. EMBC, Boston, MA, USA, rial toolkit for multimodal profiling by end-to-end learning,” 2018,
2011, pp. 1749–1752. arXiv:1802.01115.