Chord Detection Using Deep Learning
Chord Detection Using Deep Learning
Chord Detection Using Deep Learning
In this paper, we utilize deep learning to learn high-level During the past decade, deep learning has been considered
features for audio chord detection. The learned features, by the machine learning community to be one of the most
obtained by a deep network in bottleneck architecture, give interesting and intriguing research topics. Deep architec-
promising results and outperform state-of-the-art systems. tures promise to remove the necessity of custom-designed
We present and evaluate the results for various methods and and manually selected features as neural networks should
configurations, including input pre-processing, a bottleneck be more powerful in disentangling interacting factors and
architecture, and SVMs vs. HMMs for chord classification. thus be able to create meaningful high-level representa-
tions of the input data. Generally speaking, deep learning
combines deep neural networks with an unsupervised learn-
1. INTRODUCTION
ing model. Two major learning models are widely used
The goal of automatic chord detection is the automatic for unsupervised learning: Restricted Boltzmann Machines
recognition of the chord progression in a music recording. (RBMs) [11] and Sparse Auto Encoders [24]. A deep archi-
It is an important task in the analysis of western music tecture comprises multiple stacked layers based on one of
and music transcription in general, and it can contribute these two models. These layers can be trained one by one,
to applications such as key detection, structural segmenta- a process that is referred to as “pre-training” the network.
tion, music similarity measures, and other semantic anal- In this work, we employ RBMs to pre-train the deep archi-
ysis tasks. Despite early successes in chord detection by tecture in an unsupervised fashion; this is called a Deep
using pitch chroma features [6] and Hidden Markov Models Belief Network (DBN) [11]. DBNs, composed of a stack
(HMMs) [26], recent attempts at further increasing the de- of RBMs, essentially share the same topology with general
tection accuracy are only met with moderate success [4, 28]. neural networks: DBNs are generative probabilistic models
In recent years, deep learning approaches have gained with one visible layer and several hidden layers.
significant interest in the machine learning community as Since Hinton et al. proposed a fast learning algorithm
a way of building hierarchical representations from large for DBNs [11], it has been widely used for initializing
amounts of data. Deep learning has been applied success- deep neural networks. In deep structures, each layer learns
fully in various fields; for instance, a system for speech relationships between units in lower layers. The complexity
recognition utilizing deep learning was able to outperform of the system increases with an increasing number of RBM
state-of-the-art systems not using deep learning [10]. Sev- layers, making the structure —in theory— more powerful.
eral studies indicate that deep learning methods can be An extra softmax output layer can be added to the top of
very successful when applied to Music Information Re- the network (see Eqn (6)) [18]; its output can be interpreted
trieval (MIR) tasks, especially when used for feature learn- as the likelihood of each class.
ing [1,9,13,16]. Deep learning, with its potential to untangle LeCun and Bengio introduced the idea of applying Con-
complicated patterns in a large amount of data, should be volutional Neural Networks (CNNs) to images, speech, and
well suited for the task of chord detection. other time-series signals [15]. This approach allows to deal
In this work, we investigate Deep Networks (DNs) for with the variability in time and space to a certain degree,
learning high-level and more representative features in the as CNNs can be seen as a special type of neural network
context of chord detection, effectively replacing the widely in which the weights are shared across the input within a
used pitch chroma intermediate representation. We present certain spatial or temporal area. The weights thus act as a
individual results for different pre-processing options such kernel filter applied to the input. CNNs have been particu-
as time splicing and filtering (see Sect. 3.2), architectures larly successful in image analysis. For example, Norouzi
(see Sect. 3.4), and output classifiers (see Sect. 4). et al. used Convolutional RBMs to learn shift-invariant fea-
tures [22].
The results of a network depend largely on the network
c Xinquan Zhou, Alexander Lerch.
Licensed under a Creative Commons Attribution 4.0 International License
architecture. For example, Grezl et al. used a so-called
(CC BY 4.0). Attribution: Xinquan Zhou, Alexander Lerch. “Chord bottleneck architecture neural network to obtain features
Detection Using Deep Learning”, 16th International Society for Music for speech recognition and showed that these features im-
Information Retrieval Conference, 2015. prove the accuracy of the task [8]. The principle behind
52
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 53
N
X The network is implemented using the Kaldi package devel-
k+1
y1 (n) = a x(n N + k) (3) oped by John Hopkins University [25].
k=1
N
X 4. CLASSIFICATION
k+1
y2 (n) = a x(n + N k) (4)
k=1
The output of the softmax layer can be interpreted as the
The filter length is N and a is the exponential base. These likelihood of each chord class; simply taking the maximum
two filters are not centered around the current frame any- will provide a class decision (this method will be referred
more but shifted by N frames. Their impulse responses are to as Argmax). Alternatively, the output can be treated as
symmetric to each other. One could interpret these filters intermediate feature vector that can be used as an input to
as focusing on past and future frames, respectively. The other classifiers for computing the final decision.
presented filters will be referred to as “extension filters”.
4.1 Support Vector Machine
The ideas of splicing and convolution can be combined,
as exemplified in Fig. 3. Support Vector Machines (SVMs) are, as widely used classi-
Furthermore, similar to the process in CNNs, a maxi- fiers with generally good performance. The SVM is trained
mum pooling operation on the output of the spliced filters using the output of the network as features, and the classifi-
is optionally applied. The operation takes the maximum cation is carried out frame by frame. The classification is
value among different filters per “bin”. followed by a simple prediction smoothing.
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 55
improve the WCSR compared to the direct (Argmax) output (WSCR). Note that this is not true for the training set (Train-
of the network. Apparently, the SVM is not able to improve ing WSCR), for which the common architecture achieves
separability of the learned output features. results in the same range or better than the bottleneck archi-
tecture. The difference between the results on the training
6.2 Pre-processing set and the test set are thus much larger for the common
architecture than for the bottleneck architecture. The bot-
As stated in Sect. 3.2, we are interested in the applica-
tleneck architecture is clearly advantageous to use in this
tion of different filters in the pre-processing stage. In the
task: it reduces complexity and thus the training workload
first experiment (Filtering), an anti-causal single pole fil-
and increases the classification performance significantly.
ter (see Eqn (2)) is evaluated with the parameter ↵ set to
Furthermore, the comparison of classifier performance be-
0.25, 0.5, and 0.75, respectively. The second experiment
tween training and test set in Table 3 clearly indicates that
(Spliced Filters), splices these filter outputs with the outputs
the common architecture tends to fit more to the training
of the extension filters as introduced in Sect. 3.2. These
data, and is thus more prone to overfitting.
experiments are carried out with the DNDBN DNN training
scenario, a bottleneck architecture, and an HMM classifier. 6.3.2 Single-Label vs. Multi-Label
Table 2 lists the results of these pre-processing variants. It
As mentioned above, the pitch chroma is the standard fea-
can be observed that the network trained with filtered inputs
ture representation for audio chord detection. Since we
slightly outperforms the network without pre-processing;
use the output of our deep network as feature, it seems an
splicing the filtered input with the extension filter outputs
intuitive choice to learn pitch class information (and thus, a
increases the results drastically.
pitch chroma) instead of the chord classes. By doing so, the
number of outputs is reduced by a factor of two (or higher
6.3 Architecture in the case of more chords), and there would also be a closer
6.3.1 Common vs. Bottleneck relation between the output and the input representation,
the CQT. Therefore, the abstraction and complexity of the
The results of Grezl et al. indicate that a bottleneck architec- task might be decreased. It will, however, lead to another
ture should be more suitable to learn high-level features than issue: the single-label output (one chord per output) will
a common architecture and reduce overfitting [8]. In order be changed into a multi-label output (multiple pitches per
to verify these characteristics for our task, the performance output). Therefore, the learning has to be modified to allow
of both architectures is evaluated in comparison. The re- multiple simultaneous (pitch class) labels. The experiment
sults are listed in Table 3 for three pre-processing scenarios: is carried out with both Splicing and Filtering in the pre-
no additional pre-processing (None), Spliced Filters and processing, the DNDBN DNN training scenario, and HMM
spliced filters followed by a max pooling (Pooling). In or- classifiers. Table 4 lists the results.
der to allow conclusions about overfitting, both the WCSR Boulanger-Lewandowski et al. report combining chroma
of the test set and the training set are reported. All results features with chord labels for their recurrent neural network
are computed for the DNDBN DNN training scenario with and report a slightly improved result [2]. They do not, how-
HMM classifiers. ever, provide a detailed description of this combination. As
The results show that the bottleneck architecture gives can be seen from the table, the result for multi-label train-
significantly better results (p = 0.023) on the test set ing is clearly lower than the result for single-label training.
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 57
7. CONCLUSION & FUTURE WORK [8] Frantisek Grezl, Martin Karafiát, Stanislav Kontár, and
J Cernocky. Probabilistic and bottle-neck features for
In this work, we presented a system which applies deep lvcsr of meetings. In Proceedings of the International
learning to the MIR task of automatic chord detection. Our Conference on Acoustics, Speech and Signal Processing
model is able to learn high-level probabilistic representa- (ICASSP), volume 4, pages IV–757. IEEE, 2007.
tions for chords across various configurations. We have
shown that the use of a bottleneck architecture is advanta- [9] Philippe Hamel and Douglas Eck. Learning features
geous as it reduces overfitting and increases classifier perfor- from music audio with deep belief networks. In Pro-
mance, and that the choice of appropriate input filtering and ceedings of the International Conference on Music In-
splicing can significantly increase classifier performance. formation Retrieval (ISMIR), pages 339–344. Utrecht,
Learning a pitch class vector instead of chord likelihood The Netherlands, 2010.
58 Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
[10] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, [21] Yizhao Ni, Matt McVicar, Raul Santos-Rodriguez, and
Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- Tijl De Bie. Using hyper-genre training to explore genre
nior, Vincent Vanhoucke, Patrick Nguyen, Tara N information for automatic chord estimation. In Proceed-
Sainath, et al. Deep neural networks for acoustic mod- ings of the International Conference on Music Informa-
eling in speech recognition: The shared views of four tion Retrieval (ISMIR), pages 109–114, 2012.
research groups. Signal Processing Magazine, IEEE,
29(6):82–97, 2012. [22] Mohammad Norouzi, Mani Ranjbar, and Greg Mori.
Stacks of convolutional restricted boltzmann machines
[11] Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. for shift-invariant feature learning. In Proceedings of
A fast learning algorithm for deep belief nets. Neural the Conference on Computer Vision and Pattern Recog-
computation, 18(7):1527–1554, 2006. nition (CVPR), pages 2735–2742. IEEE, 2009.
[12] Eric J Humphrey and Juan Pablo Bello. Rethinking [23] Hélene Papadopoulos and Geoffroy Peeters. Large-scale
automatic chord recognition with convolutional neural study of chord estimation algorithms based on chroma
networks. In Proceedings of the International Confer- representation and hmm. In Proceedings of the Interna-
ence on Machine Learning and Applications (ICMLA), tional Workshop on Content-Based Multimedia Index-
volume 2, pages 357–362. IEEE, 2012. ing (CBMI), pages 53–60. IEEE, 2007.
[13] Eric J Humphrey, Juan Pablo Bello, and Yann LeCun. [24] Christopher Poultney, Sumit Chopra, Yann L Cun, et al.
Moving beyond feature design: Deep architectures and Efficient learning of sparse representations with an
automatic feature learning in music informatics. In Pro- energy-based model. In Advances in neural informa-
ceedings of the International Conference on Music In- tion processing systems, pages 1137–1144, 2006.
formation Retrieval (ISMIR), pages 403–408, 2012.
[25] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
[14] Eric J Humphrey, Taemin Cho, and Juan Pablo Bello. Burget, Ondrej Glembek, Nagendra Goel, Mirko Han-
Learning a robust tonnetz-space transform for automatic nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,
chord recognition. In Proceedings of the International Jan Silovsky, Georg Stemmer, and Karel Vesely. The
Conference on Acoustics, Speech and Signal Processing kaldi speech recognition toolkit. In Proceedings of the
(ICASSP), pages 453–456. IEEE, 2012. Workshop on Automatic Speech Recognition and Under-
[15] Yann LeCun and Yoshua Bengio. Convolutional net- standing. IEEE Signal Processing Society, December
works for images, speech, and time series. The hand- 2011. IEEE Catalog No.: CFP11SRW-USB.
book of brain theory and neural networks, 3361:310, [26] Alexander Sheh and Daniel PW Ellis. Chord segmenta-
1995. tion and recognition using em-trained hidden markov
[16] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y models. Proceedings of the International Conference on
Ng. Unsupervised feature learning for audio classifica- Music Information Retrieval (ISMIR), pages 185–191,
tion using convolutional deep belief networks. In Ad- 2003.
vances in neural information processing systems, 2009. [27] J Sola and J Sevilla. Importance of input data normaliza-
[17] Kyogu Lee and Malcolm Slaney. Acoustic chord tran- tion for the application of neural networks to complex
scription and key extraction from audio using key- industrial problems. Nuclear Science, IEEE Transac-
dependent HMMs trained on synthesized audio. Audio, tions on, 44(3):1464–1468, 1997.
Speech, and Language Processing, IEEE Transactions
[28] Yushi Ueda, Yuuki Uchiyama, Takuya Nishimoto,
on, 16(2):291–301, 2008.
Nobutaka Ono, and Shigeki Sagayama. Hmm-based
[18] Thomas M Martinetz, Stanislav G Berkovich, and approach for automatic chord detection using refined
Klaus J Schulten. Neural-gas’ network for vector quan- acoustic features. In Proceedings of the International
tization and its application to time-series prediction. Conference on Acoustics, Speech and Signal Processing
Neural Networks, IEEE Transactions on, 4(4):558–569, (ICASSP), pages 5518–5521. IEEE, 2010.
1993.
[19] Matthias Mauch, Chris Cannam, Matthew Davies, Si-
mon Dixon, Christopher Harte, Sefki Kolozali, Dan
Tidhar, and Mark Sandler. Omras2 metadata project
2009. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), 2009.
[20] Matthias Mauch and Simon Dixon. Approximate note
transcription for the improved identification of difficult
chords. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), pages 135–
140, 2010.