Chord Detection Using Deep Learning
In this paper, we utilize deep learning to learn high-level During the past decade, deep learning has been considered
features for audio chord detection. The learned features, by the machine learning community to be one of the most
obtained by a deep network in bottleneck architecture, give interesting and intriguing research topics. Deep architec-
promising results and outperform state-of-the-art systems. tures promise to remove the necessity of custom-designed
We present and evaluate the results for various methods and and manually selected features as neural networks should
configurations, including input pre-processing, a bottleneck be more powerful in disentangling interacting factors and
architecture, and SVMs vs. HMMs for chord classification. thus be able to create meaningful high-level representa-
tions of the input data. Generally speaking, deep learning
combines deep neural networks with an unsupervised learn-
ing model. Two major learning models are widely used
The goal of automatic chord detection is the automatic for unsupervised learning: Restricted Boltzmann Machines
recognition of the chord progression in a music recording. (RBMs) [11] and Sparse Auto Encoders [24]. A deep archi-
It is an important task in the analysis of western music tecture comprises multiple stacked layers based on one of
and music transcription in general, and it can contribute these two models. These layers can be trained one by one,
to applications such as key detection, structural segmenta- a process that is referred to as “pre-training” the network.
tion, music similarity measures, and other semantic anal- In this work, we employ RBMs to pre-train the deep archi-
ysis tasks. Despite early successes in chord detection by tecture in an unsupervised fashion; this is called a Deep
using pitch chroma features [6] and Hidden Markov Models Belief Network (DBN) [11]. DBNs, composed of a stack
(HMMs) [26], recent attempts at further increasing the de- of RBMs, essentially share the same topology with general
tection accuracy are only met with moderate success [4, 28]. neural networks: DBNs are generative probabilistic models
In recent years, deep learning approaches have gained with one visible layer and several hidden layers.
significant interest in the machine learning community as Since Hinton et al. proposed a fast learning algorithm
a way of building hierarchical representations from large for DBNs [11], it has been widely used for initializing
amounts of data. Deep learning has been applied success- deep neural networks. In deep structures, each layer learns
fully in various fields; for instance, a system for speech relationships between units in lower layers. The complexity
recognition utilizing deep learning was able to outperform of the system increases with an increasing number of RBM
state-of-the-art systems not using deep learning [10]. Sev- layers, making the structure —in theory— more powerful.
eral studies indicate that deep learning methods can be An extra softmax output layer can be added to the top of
very successful when applied to Music Information Re- the network (see Eqn (6)) [18]; its output can be interpreted
trieval (MIR) tasks, especially when used for feature learn- as the likelihood of each class.
ing [1,9,13,16]. Deep learning, with its potential to untangle LeCun and Bengio introduced the idea of applying Con-
complicated patterns in a large amount of data, should be volutional Neural Networks (CNNs) to images, speech, and
well suited for the task of chord detection. other time-series signals [15]. This approach allows to deal
In this work, we investigate Deep Networks (DNs) for with the variability in time and space to a certain degree,
learning high-level and more representative features in the as CNNs can be seen as a special type of neural network
context of chord detection, effectively replacing the widely in which the weights are shared across the input within a
used pitch chroma intermediate representation. We present certain spatial or temporal area. The weights thus act as a
individual results for different pre-processing options such kernel filter applied to the input. CNNs have been particu-
as time splicing and filtering (see Sect. 3.2), architectures larly successful in image analysis. For example, Norouzi
(see Sect. 3.4), and output classifiers (see Sect. 4). et al. used Convolutional RBMs to learn shift-invariant fea-
tures [22].
The results of a network depend largely on the network
X The network is implemented using the Kaldi package devel-
y1 (n) = a x(n N + k) (3) oped by John Hopkins University [25].
y2 (n) = a x(n + N k) (4)
The output of the softmax layer can be interpreted as the
The filter length is N and a is the exponential base. These likelihood of each chord class; simply taking the maximum
two filters are not centered around the current frame any- will provide a class decision (this method will be referred
more but shifted by N frames. Their impulse responses are to as Argmax). Alternatively, the output can be treated as
symmetric to each other. One could interpret these filters intermediate feature vector that can be used as an input to
as focusing on past and future frames, respectively. The other classifiers for computing the final decision.
presented filters will be referred to as “extension filters”.
4.1 Support Vector Machine
The ideas of splicing and convolution can be combined,
as exemplified in Fig. 3. Support Vector Machines (SVMs) are, as widely used classi-
Furthermore, similar to the process in CNNs, a maxi- fiers with generally good performance. The SVM is trained
mum pooling operation on the output of the spliced filters using the output of the network as features, and the classifi-
is optionally applied. The operation takes the maximum cation is carried out frame by frame. The classification is
value among different filters per “bin”. followed by a simple prediction smoothing.
improve the WCSR compared to the direct (Argmax) output (WSCR). Note that this is not true for the training set (Train-
of the network. Apparently, the SVM is not able to improve ing WSCR), for which the common architecture achieves
separability of the learned output features. results in the same range or better than the bottleneck archi-
tecture. The difference between the results on the training
6.2 Pre-processing set and the test set are thus much larger for the common
architecture than for the bottleneck architecture. The bot-
As stated in Sect. 3.2, we are interested in the applica-
tleneck architecture is clearly advantageous to use in this
tion of different filters in the pre-processing stage. In the
task: it reduces complexity and thus the training workload
first experiment (Filtering), an anti-causal single pole fil-
and increases the classification performance significantly.
ter (see Eqn (2)) is evaluated with the parameter ↵ set to
Furthermore, the comparison of classifier performance be-
0.25, 0.5, and 0.75, respectively. The second experiment
tween training and test set in Table 3 clearly indicates that
(Spliced Filters), splices these filter outputs with the outputs
the common architecture tends to fit more to the training
of the extension filters as introduced in Sect. 3.2. These
data, and is thus more prone to overfitting.
experiments are carried out with the DNDBN DNN training
scenario, a bottleneck architecture, and an HMM classifier. 6.3.2 Single-Label vs. Multi-Label
Table 2 lists the results of these pre-processing variants. It
As mentioned above, the pitch chroma is the standard fea-
can be observed that the network trained with filtered inputs
ture representation for audio chord detection. Since we
slightly outperforms the network without pre-processing;
use the output of our deep network as feature, it seems an
splicing the filtered input with the extension filter outputs
intuitive choice to learn pitch class information (and thus, a
increases the results drastically.
pitch chroma) instead of the chord classes. By doing so, the
number of outputs is reduced by a factor of two (or higher
6.3 Architecture in the case of more chords), and there would also be a closer
6.3.1 Common vs. Bottleneck relation between the output and the input representation,
the CQT. Therefore, the abstraction and complexity of the
The results of Grezl et al. indicate that a bottleneck architec- task might be decreased. It will, however, lead to another
ture should be more suitable to learn high-level features than issue: the single-label output (one chord per output) will
a common architecture and reduce overfitting [8]. In order be changed into a multi-label output (multiple pitches per
to verify these characteristics for our task, the performance output). Therefore, the learning has to be modified to allow
of both architectures is evaluated in comparison. The re- multiple simultaneous (pitch class) labels. The experiment
sults are listed in Table 3 for three pre-processing scenarios: is carried out with both Splicing and Filtering in the pre-
no additional pre-processing (None), Spliced Filters and processing, the DNDBN DNN training scenario, and HMM
spliced filters followed by a max pooling (Pooling). In or- classifiers. Table 4 lists the results.
der to allow conclusions about overfitting, both the WCSR Boulanger-Lewandowski et al. report combining chroma
of the test set and the training set are reported. All results features with chord labels for their recurrent neural network
are computed for the DNDBN DNN training scenario with and report a slightly improved result [2]. They do not, how-
HMM classifiers. ever, provide a detailed description of this combination. As
The results show that the bottleneck architecture gives can be seen from the table, the result for multi-label train-
significantly better results (p = 0.023) on the test set ing is clearly lower than the result for single-label training.
