Chord Detection Using Deep Learning

CHORD DETECTION USING DEEP LEARNING
Xinquan Zhou Alexander Lerch

Center for Music Technology Center for Music Technology
Georgia Institute of Technology Georgia Institute of Technology
[email protected] [email protected]
ABSTRACT 2. RELATED WORK
In this paper, we utilize deep learning to learn high-level During the past decade, deep learning has been considered
features for audio chord detection. The learned features, by the machine learning community to be one of the most
obtained by a deep network in bottleneck architecture, give interesting and intriguing research topics. Deep architec-
promising results and outperform state-of-the-art systems. tures promise to remove the necessity of custom-designed
We present and evaluate the results for various methods and and manually selected features as neural networks should
configurations, including input pre-processing, a bottleneck be more powerful in disentangling interacting factors and
architecture, and SVMs vs. HMMs for chord classification. thus be able to create meaningful high-level representa-
tions of the input data. Generally speaking, deep learning
combines deep neural networks with an unsupervised learn-
1. INTRODUCTION
ing model. Two major learning models are widely used
The goal of automatic chord detection is the automatic for unsupervised learning: Restricted Boltzmann Machines
recognition of the chord progression in a music recording. (RBMs) [11] and Sparse Auto Encoders [24]. A deep archi-
It is an important task in the analysis of western music tecture comprises multiple stacked layers based on one of
and music transcription in general, and it can contribute these two models. These layers can be trained one by one,
to applications such as key detection, structural segmenta- a process that is referred to as “pre-training” the network.
tion, music similarity measures, and other semantic anal- In this work, we employ RBMs to pre-train the deep archi-
ysis tasks. Despite early successes in chord detection by tecture in an unsupervised fashion; this is called a Deep
using pitch chroma features [6] and Hidden Markov Models Belief Network (DBN) [11]. DBNs, composed of a stack
(HMMs) [26], recent attempts at further increasing the de- of RBMs, essentially share the same topology with general
tection accuracy are only met with moderate success [4, 28]. neural networks: DBNs are generative probabilistic models
In recent years, deep learning approaches have gained with one visible layer and several hidden layers.
significant interest in the machine learning community as Since Hinton et al. proposed a fast learning algorithm
a way of building hierarchical representations from large for DBNs [11], it has been widely used for initializing
amounts of data. Deep learning has been applied success- deep neural networks. In deep structures, each layer learns
fully in various fields; for instance, a system for speech relationships between units in lower layers. The complexity
recognition utilizing deep learning was able to outperform of the system increases with an increasing number of RBM
state-of-the-art systems not using deep learning [10]. Sev- layers, making the structure —in theory— more powerful.
eral studies indicate that deep learning methods can be An extra softmax output layer can be added to the top of
very successful when applied to Music Information Re- the network (see Eqn (6)) [18]; its output can be interpreted
trieval (MIR) tasks, especially when used for feature learn- as the likelihood of each class.
ing [1,9,13,16]. Deep learning, with its potential to untangle LeCun and Bengio introduced the idea of applying Con-
complicated patterns in a large amount of data, should be volutional Neural Networks (CNNs) to images, speech, and
well suited for the task of chord detection. other time-series signals [15]. This approach allows to deal
In this work, we investigate Deep Networks (DNs) for with the variability in time and space to a certain degree,
learning high-level and more representative features in the as CNNs can be seen as a special type of neural network
context of chord detection, effectively replacing the widely in which the weights are shared across the input within a
used pitch chroma intermediate representation. We present certain spatial or temporal area. The weights thus act as a
individual results for different pre-processing options such kernel filter applied to the input. CNNs have been particu-
as time splicing and filtering (see Sect. 3.2), architectures larly successful in image analysis. For example, Norouzi
(see Sect. 3.4), and output classifiers (see Sect. 4). et al. used Convolutional RBMs to learn shift-invariant fea-
tures [22].
The results of a network depend largely on the network
c Xinquan Zhou, Alexander Lerch.
Licensed under a Creative Commons Attribution 4.0 International License
architecture. For example, Grezl et al. used a so-called
(CC BY 4.0). Attribution: Xinquan Zhou, Alexander Lerch. “Chord bottleneck architecture neural network to obtain features
Detection Using Deep Learning”, 16th International Society for Music for speech recognition and showed that these features im-
Information Retrieval Conference, 2015. prove the accuracy of the task [8]. The principle behind
52
Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015 53
Figure 1. Visualization of a bottleneck architecture
the bottleneck-shaped architecture is that the number of

neurons in the middle layer is lower than in the other lay-
ers as shown in Fig. 1. A network with bottleneck can
be structured in two sections: (i) Section 1 from the first
Figure 2. The overview of our system
layer to the bottleneck layer, with a gradual decrease of the
number of neurons per layer, functions as an encoding or
compression process which compacts relevant information optimizing and tuning specific components.
and discards redundant information, and (ii) Section 2 from
the bottleneck layer to the last layer with a gradual increase
in the number of neurons per layer. The function of this 3. SYSTEM OVERVIEW
part can be interpreted as a decoding process. An additional Figure 2 gives an overview of all components and process-
benefit of bottleneck architectures is that they can reduce ing steps of the presented system. The following section
overfitting by decreasing the system complexity. will discuss all of these steps in detail.
Recently, more researchers investigated deep learning in
the context of MIR. Lee et al. pioneered the application of 3.1 Input Representation
convolutional deep learning for audio feature learning [16].
The input audio is converted to a sample rate of 11.025 kHz.
Hamel et al. used the features learned from music with a
Then, a Constant Q transform (CQT) is applied. The
DBN for both music genre classification and music auto-
CQT [3] is a perceptually inspired time-frequency transfor-
tagging [9]; their system was successful in MIREX 2011
mation for audio. The resulting frequency bins are equally
with top-ranked results. Battenberg employed a conditional
spaced on a logarithmic (“pitch”) scale. It has the advan-
DBN to analyze drum patterns [1]. The use of deep archi-
tage of providing a more musically and perceptually mean-
tectures for chord detections, however, has not yet been
ingful spectral representation than the DFT. We used an
explored, although modern neural networks have been em-
implementation of the CQT as a filterbank of Gabor filters,
ployed in this field. For instance, Boulanger et al. inves-
spaced at 36 bins per octave, i.e., 3 bins per semitone, yield-
tigated recurrent neural networks [2] and Humphrey has
ing 180 bins representing a frequency range spanning from
explored CNNs [12, 14]. While they also used the concept
110 Hz to 3.520 kHz. Finally, we used Principal Compo-
of pre-training, their architectures have only two or 3 layers
nent Analysis (PCA) for decorrelation, and applied Z-Score
and thus cannot be called “deep”.
normalization [27].
The basic buildings blocks of most modern approaches
to chord detection can be traced back to two seminal pub- 3.2 Pre-processing
lications: Fujishima introduced pitch chroma vectors ex-
tracted from the audio as input feature for chord detec- Neighboring frames of the input representation can be ex-
tion [6] and Sheh et al. proposed to use HMMs for repre- pected to contain similar content, as chords will not change
senting chords as hidden states and to model the transition on a frame-by-frame basis. In order to take into account
probability of chords [26]. Since then, there have been a the relationship between the current frame and previous
lot of studies using chroma features and HMMs for chord and future frames, we investigate the application of several
detection [5, 23]. Examples for recent systems are Ni et al., pre-processing approaches.
using a genre-independent chord estimation method based
3.2.1 Time Splicing
on HMM and chroma features [21] and Cho and Bello,
who used multi-band features and a multi-stream HMM for Time splicing is a simple way to extend the current frame
chord recognition [4]. Training HMMs with pitch chroma with the data of neighboring frames by concatenating the
features arguably is the standard approach for this task and frames into one larger superframe. In first order time splic-
the progress is less marked by major innovations but by ing, we concatenate the current frame, the previous frame,
54 Proceedings of the 16th ISMIR Conference, Málaga, Spain, October 26-30, 2015
and the following frame. Thus, each superframe consists

of three neighboring frames. Since the same operation will
be applied to all frames, there will be overlap introduced
between neighboring superframes.
3.2.2 Convolution
CNNs are extensively used in tasks with highly correlated
inputs (e.g., the recognition of hand-written digits). Many
time series show similar properties so that CNNs seem Figure 3. Splicing output of different filters
to be an appropriate choice in the context of audio, too.
Essentially, CNNs have one or more convolutional layers
between the input and lower layers of the neural network. 3.3 Training
The function of a convolutional layer can be interpreted as It is impractical to train DNNs directly with back propaga-
the application of a linear filter plus a non-linear transfor- tion using gradient decent due to their deep structure and
mation, sometimes also combined with a pooling operation: the limited amount of training samples. Therefore, the net-
work is usually initialized by an unsupervised pre-training
Y = pool(sigm(K ⇤ X + B)), (1)
step. As our network consists of RBMs, Gibbs sampling
in which Y is the output of a convolutional layer, K is the can be used for training [11]. The objective is to retain as
linear kernel filter (i.e., the impulse response), X is the much information as possible between input and output.
input, B is the bias, sigm() is a non-linear transform, and The computation for layer l can be represented as:
pool() is a down-sampling operation. The uniqueness of
convolutional networks stems from the convolution opera- Yl = sigm(Wl Xl + Bl ), (5)
tion applied to the input X. Since, unfortunately, we had no
which is identical to many traditional neural networks. Thus,
access to a deep learning toolbox with support for the con-
a standard back propagation can be applied after pre-training
volution operation in the time domain, we opted to employ
to fine-tune the network in a supervised manner. The loss
an optional pre-processing step inspired by CNNs, namely
criterion we use in this work is cross-entropy.
by applying filters to the input of the network. However,
instead of learning the filters, we evaluate several manu-
3.4 Architecture
ally designed filters: a single-pole low pass filter and two
FIR low pass filters with exponentially shaped impulse re- We investigate a deep network with 6 layers in two different
sponses. The single pole low pass filter produces the output architectures. The common architecture features the same
y for an input x, given the parameter ↵: amount of neurons in every layer, in our case 1024. The
bottleneck architecture has 256 neurons in the middle layer
yn = (1 ↵)yn 1 + ↵xn (2) and 512 neurons in the layers neighboring the middle layer.
We apply anti-causal filtering and filter the signal in both The remaining layers consist 1024 neurons each (compare
directions so that the resulting overall filter has a zero-phase [8]). A softmax output layer is stacked on top of both
response. architectures as described by Eqn (6).
The other two low pass filters have exponential decay
exp(Yl )
shaped impulse response. The difference equations are softmax(Yl ) = PN (6)
given in Eqn (3) and Eqn (4). k=1 exp(Yk )
N
X The network is implemented using the Kaldi package devel-
k+1
y1 (n) = a x(n N + k) (3) oped by John Hopkins University [25].
k=1
N
X 4. CLASSIFICATION
k+1
y2 (n) = a x(n + N k) (4)
k=1
The output of the softmax layer can be interpreted as the
The filter length is N and a is the exponential base. These likelihood of each chord class; simply taking the maximum
two filters are not centered around the current frame any- will provide a class decision (this method will be referred
more but shifted by N frames. Their impulse responses are to as Argmax). Alternatively, the output can be treated as
symmetric to each other. One could interpret these filters intermediate feature vector that can be used as an input to
as focusing on past and future frames, respectively. The other classifiers for computing the final decision.
presented filters will be referred to as “extension filters”.
4.1 Support Vector Machine
The ideas of splicing and convolution can be combined,
as exemplified in Fig. 3. Support Vector Machines (SVMs) are, as widely used classi-
Furthermore, similar to the process in CNNs, a maxi- fiers with generally good performance. The SVM is trained
mum pooling operation on the output of the spliced filters using the output of the network as features, and the classifi-
is optionally applied. The operation takes the maximum cation is carried out frame by frame. The classification is
value among different filters per “bin”. followed by a simple prediction smoothing.
4.2 Hidden Markov Model

HMMs are, as pointed out above, the standard classifier
for automatic chord detection because the characteristics
of the task fit the HMM approach well: Chords are hid-
den states that can be estimated from observations (feature
vectors extracted from the audio signal), and the likelihood
of chord transitions can be modeled with transition prob-
abilities. Modified HMMs such as ergodic HMMs and
key-independent HMMs have been also explored for this
task [17, 23]. In this work we are mostly interested in the
performance comparison between high-level features, so
a simple first-order HMM is used. Given the probabilistic
characteristic of the softmax output layer, it can be directly
as emission probabilities for the HMM. Therefore, there is
no need to train the HMM using, e.g., the commonly used
Baum-Welch algorithm. Instead, the histogram of each Figure 4. Chords histogram
class in our training is used as initial probabilities, and the
bigram of chord transitions is used to compute the transi-
tion probabilities. Finally, we employ the Viterbi decoding triad major/minor and seventh major/minor. Other chord
algorithm to find the globally optimal chord sequence. types are treated as unknown chords. For instance, G:maj
and G:maj7 are mapped to ‘G:maj’; G:dim and G:6 are all
mapped to ‘N’. The histogram of chords in our dataset after
5. EVALUATION PROCEDURE
such mapping is shown in Fig. 4.
5.1 Dataset
5.3 Evaluation Metric
Our dataset is a combination of several different datasets,
yielding a 317-piece collection. The data is composed of The used evaluation metric is the same as proposed in the
• 180 songs from the Beatles dataset [19], audio chord detection task for MIREX 2013: the Weighted
• 100 songs from the RWC Pop dataset [7], Chord Symbol Recall (WCSR). WCSR is defined as the
• 18 songs from the Zweieck dataset [19], and total duration of segments with correct prediction as formu-
• 19 songs from Queen dataset [19]. lated in Eqn (8):
The pre-processing as described in Sect. 3.2 ensures identi- n
cal input audio formats. 1 X
W CSR = Ck , (8)
N
k=1
5.2 Methodology
in which n is the number of test samples (songs), N is
The dataset is divided randomly into two parts: 80% for the the total number of frames in all test samples, and Ck is
training set and 20% for the test set. On the training scale, the number of frames that are correctly detected in the kth
we use a frame-based strategy, which means we divide each sample.
song into frames, and treat each frame as an independent
training sample. On average each song is divided into
about 1200 frames resulting in approximately 300k training 6. EXPERIMENTS
samples and approximately 76k test samples. 6.1 Post-classifiers
Within the training set, 10% of the data is used as a vali-
dation set. For the post-processing, all data in the training In this experiment, the network is initialized with pre-
set will be used to train the post-classifier. training, followed by fine tuning using back propagation.
Time constraints and the workload requirements for train- This configuration will be referred to as DNDBN DN N .
ing deep networks made a cross validation for evaluation No pre-processing is applied to the data; the input is sim-
impractical. ply the input representation (CQT followed by PCA) as
The chosen ground truth for classification are major and described in Sect. 3.1. The chosen architecture is the bottle-
minor triads for every root note, resulting in a dictionary neck architecture. Three different classifiers are compared:
of 24 + 1 chord labels. Ground truth time-aligned chord the maximum of the softmax output (Argmax), an SVM,
symbols are mapped to this major/minor dictionary: and an HMM.
The results listed in Table 1 are unambiguous and un-
Chordmajmin ⇢ {N } [ {S ⇥ maj, min} (7) surprising: the HMM with Viterbi decoding outperforms
the SVM; using HMMs with a model for transition prob-
with S representing the 12 pitch classes (root notes) and abilities is an appropriate approach to chord detection as
N being the label for unknown chords. In the calculation it models the dynamic properties of chord progressions,
of the detection accuracy, the following chord types are which cannot be done with non-dynamic classifiers such
mapped to the corresponding major/minor in the dictionary: as SVMs. One noteworthy result is that the SVM does not
Training Scenario Classifier WCSR Training

Architecture Pre-processing WCSR
DNDBN DNN Argmax() 0.648 WCSR
DNDBN DNN SVM 0.645 Common None 0.843 0.703
DNDBN DNN HMM 0.755 Bottleneck None 0.855 0.755
Common Spliced Filters 0.985 0.876
Table 1. Chord detection performance using different post- Bottleneck Spliced Filters 0.936 0.919
classifiers Common Pooling 0.965 0.875
Bottleneck Pooling 0.960 0.916
↵ Pre-processing WCSR
0.25 Filtering 0.758 Table 3. Chord detection performance for different archi-
0.25 Spliced Filters 0.912 tectures and pre-processing steps
0.5 Filtering 0.787
0.5 Spliced Filters 0.857 Learning Targets WCSR
0.75 Filtering 0.798
Single-Label — 25 Chord Classes 0.919
0.75 Spliced Filters 0.919
Multi-Label — 12 Pitch Classes 0.78
Table 2. Chord detection performance using different filter
Table 4. Chord detection performance for single-label vs.
parameters
multi-label learning
improve the WCSR compared to the direct (Argmax) output (WSCR). Note that this is not true for the training set (Train-
of the network. Apparently, the SVM is not able to improve ing WSCR), for which the common architecture achieves
separability of the learned output features. results in the same range or better than the bottleneck archi-
tecture. The difference between the results on the training
6.2 Pre-processing set and the test set are thus much larger for the common
architecture than for the bottleneck architecture. The bot-
As stated in Sect. 3.2, we are interested in the applica-
tleneck architecture is clearly advantageous to use in this
tion of different filters in the pre-processing stage. In the
task: it reduces complexity and thus the training workload
first experiment (Filtering), an anti-causal single pole fil-
and increases the classification performance significantly.
ter (see Eqn (2)) is evaluated with the parameter ↵ set to
Furthermore, the comparison of classifier performance be-
0.25, 0.5, and 0.75, respectively. The second experiment
tween training and test set in Table 3 clearly indicates that
(Spliced Filters), splices these filter outputs with the outputs
the common architecture tends to fit more to the training
of the extension filters as introduced in Sect. 3.2. These
data, and is thus more prone to overfitting.
experiments are carried out with the DNDBN DNN training
scenario, a bottleneck architecture, and an HMM classifier. 6.3.2 Single-Label vs. Multi-Label
Table 2 lists the results of these pre-processing variants. It
As mentioned above, the pitch chroma is the standard fea-
can be observed that the network trained with filtered inputs
ture representation for audio chord detection. Since we
slightly outperforms the network without pre-processing;
use the output of our deep network as feature, it seems an
splicing the filtered input with the extension filter outputs
intuitive choice to learn pitch class information (and thus, a
increases the results drastically.
pitch chroma) instead of the chord classes. By doing so, the
number of outputs is reduced by a factor of two (or higher
6.3 Architecture in the case of more chords), and there would also be a closer
6.3.1 Common vs. Bottleneck relation between the output and the input representation,
the CQT. Therefore, the abstraction and complexity of the
The results of Grezl et al. indicate that a bottleneck architec- task might be decreased. It will, however, lead to another
ture should be more suitable to learn high-level features than issue: the single-label output (one chord per output) will
a common architecture and reduce overfitting [8]. In order be changed into a multi-label output (multiple pitches per
to verify these characteristics for our task, the performance output). Therefore, the learning has to be modified to allow
of both architectures is evaluated in comparison. The re- multiple simultaneous (pitch class) labels. The experiment
sults are listed in Table 3 for three pre-processing scenarios: is carried out with both Splicing and Filtering in the pre-
no additional pre-processing (None), Spliced Filters and processing, the DNDBN DNN training scenario, and HMM
spliced filters followed by a max pooling (Pooling). In or- classifiers. Table 4 lists the results.
der to allow conclusions about overfitting, both the WCSR Boulanger-Lewandowski et al. report combining chroma
of the test set and the training set are reported. All results features with chord labels for their recurrent neural network
are computed for the DNDBN DNN training scenario with and report a slightly improved result [2]. They do not, how-
HMM classifiers. ever, provide a detailed description of this combination. As
The results show that the bottleneck architecture gives can be seen from the table, the result for multi-label train-
significantly better results (p = 0.023) on the test set ing is clearly lower than the result for single-label training.
Method WCSR by incorporating multi-label learning proved to be less suc-

cessful. The idea has, however, a certain appeal and would
Chordino 0.625
allow the number of output nodes to be independent of the
Best Configuration 0.919
number of chords to be detected. It is also conceivable to
Best Configuration with Max Pooling 0.916
investigate a different option for the network output: in-
stead of training chords or pitch classes we could — under
Table 5. Comparison of the performance of the best config-
the assumption that we are only after chords comprised
uration with Chordino
of stacked third intervals — train the output with octave-
independent third intervals in a multi-label scenario with
Possible reasons for bad performance include (i) difficulties 24 output nodes.
with multi-target learning, since it increases the difficulty to
train; furthermore, our implementation of multi-label train- 8. REFERENCES
ing might be sub-optimal as the same posterior is assigned
to each target without any information on the pitch class [1] Eric Battenberg and David Wessel. Analyzing drum
energy, and (ii) the issue that not all pitches always sound patterns using conditional deep belief networks. In Pro-
simultaneously in a chord (or might be missing altogether) ceedings of the International Conference on Music In-
might have larger impact on the multi-label training than formation Retrieval (ISMIR), pages 37–42, 2012.
on the single-label training.
[2] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and
Pascal Vincent. Audio chord recognition with recurrent
6.4 Results & Discussion
neural networks. In Proceedings of the International
It is challenging to compare the results to previously pub- Conference on Music Information Retrieval (ISMIR),
lished results due to varying evaluation methodologies, met- pages 335–340, 2013.
rics, and datasets. It seems that the results of Cho and
Bello [4], who reported a performance of about 76%, were [3] Judith C Brown. Calculation of a constant q spectral
computed with a comparable dataset. The recent MIREX re- transform. The Journal of the Acoustical Society of
sults on Chord Detection generally show lower accuracy but America, 89(1):425–434, 1991.
use a different evaluation vocabulary. In order to provide [4] Taemin Cho and Juan P Bello. Mirex 2013: Large vo-
a baseline result to put results into perspective, we present cabulary chord recognition system using multi-band
the results of Chordino [20] with the default settings, com- features and a multi-stream HMM. Music Information
puted on our dataset. It should be pointed out that this Retrieval Evaluation eXchange (MIREX), 2013.
comparison is unfair as Chordino is able to detect as many
as 120 chords, compared to our 24. The label mapping [5] Taemin Cho, Ron J Weiss, and Juan Pablo Bello. Ex-
strategies are another significant issue for Chordino. Our ploring common variations in state of the art chord
label mapping results in nearly sixth of the total label being recognition systems. In Proceedings of the Sound and
“N”, which might have negative impact on the Chordino Music Computing Conference (SMC), pages 1–8, 2010.
results. The Chordino results are mapped to major/minor
the same way as the ground truth annotations. The results [6] Takuya Fujishima. Realtime chord recognition of mu-
are shown in Table 5. In the table, the Best Configuration sical sound: A system using common lisp music. In
is using Bottleneck architecture, spliced filters (↵ = 0.75) Proceedings of the International Computer Music Con-
as preprocessing, single label learning targets, and Viterbi ference (ICMC), volume 1999, pages 464–467, 1999.
decoding as post-classifier. The Best Configuration with
Max Pooling is the same as the best configuration except [7] Masataka Goto, Hiroki Hashiguchi, Takuichi
applying another max pooling layer after the spliced filters. Nishimura, and Ryuichi Oka. Rwc music database:
The latter configuration has a much reduced computational Popular, classical and jazz music databases. In
workload. The presented results are clearly competitive Proceedings of the International Conference on Music
with existing state-of-the-art systems. Information Retrieval (ISMIR), volume 2, pages
287–288, 2002.
7. CONCLUSION & FUTURE WORK [8] Frantisek Grezl, Martin Karafiát, Stanislav Kontár, and
J Cernocky. Probabilistic and bottle-neck features for
In this work, we presented a system which applies deep lvcsr of meetings. In Proceedings of the International
learning to the MIR task of automatic chord detection. Our Conference on Acoustics, Speech and Signal Processing
model is able to learn high-level probabilistic representa- (ICASSP), volume 4, pages IV–757. IEEE, 2007.
tions for chords across various configurations. We have
shown that the use of a bottleneck architecture is advanta- [9] Philippe Hamel and Douglas Eck. Learning features
geous as it reduces overfitting and increases classifier perfor- from music audio with deep belief networks. In Pro-
mance, and that the choice of appropriate input filtering and ceedings of the International Conference on Music In-
splicing can significantly increase classifier performance. formation Retrieval (ISMIR), pages 339–344. Utrecht,
Learning a pitch class vector instead of chord likelihood The Netherlands, 2010.
[10] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, [21] Yizhao Ni, Matt McVicar, Raul Santos-Rodriguez, and
Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Se- Tijl De Bie. Using hyper-genre training to explore genre
nior, Vincent Vanhoucke, Patrick Nguyen, Tara N information for automatic chord estimation. In Proceed-
Sainath, et al. Deep neural networks for acoustic mod- ings of the International Conference on Music Informa-
eling in speech recognition: The shared views of four tion Retrieval (ISMIR), pages 109–114, 2012.
research groups. Signal Processing Magazine, IEEE,
29(6):82–97, 2012. [22] Mohammad Norouzi, Mani Ranjbar, and Greg Mori.
Stacks of convolutional restricted boltzmann machines
[11] Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh. for shift-invariant feature learning. In Proceedings of
A fast learning algorithm for deep belief nets. Neural the Conference on Computer Vision and Pattern Recog-
computation, 18(7):1527–1554, 2006. nition (CVPR), pages 2735–2742. IEEE, 2009.
[12] Eric J Humphrey and Juan Pablo Bello. Rethinking [23] Hélene Papadopoulos and Geoffroy Peeters. Large-scale
automatic chord recognition with convolutional neural study of chord estimation algorithms based on chroma
networks. In Proceedings of the International Confer- representation and hmm. In Proceedings of the Interna-
ence on Machine Learning and Applications (ICMLA), tional Workshop on Content-Based Multimedia Index-
volume 2, pages 357–362. IEEE, 2012. ing (CBMI), pages 53–60. IEEE, 2007.
[13] Eric J Humphrey, Juan Pablo Bello, and Yann LeCun. [24] Christopher Poultney, Sumit Chopra, Yann L Cun, et al.
Moving beyond feature design: Deep architectures and Efficient learning of sparse representations with an
automatic feature learning in music informatics. In Pro- energy-based model. In Advances in neural informa-
ceedings of the International Conference on Music In- tion processing systems, pages 1137–1144, 2006.
formation Retrieval (ISMIR), pages 403–408, 2012.
[25] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas
[14] Eric J Humphrey, Taemin Cho, and Juan Pablo Bello. Burget, Ondrej Glembek, Nagendra Goel, Mirko Han-
Learning a robust tonnetz-space transform for automatic nemann, Petr Motlicek, Yanmin Qian, Petr Schwarz,
chord recognition. In Proceedings of the International Jan Silovsky, Georg Stemmer, and Karel Vesely. The
Conference on Acoustics, Speech and Signal Processing kaldi speech recognition toolkit. In Proceedings of the
(ICASSP), pages 453–456. IEEE, 2012. Workshop on Automatic Speech Recognition and Under-
[15] Yann LeCun and Yoshua Bengio. Convolutional net- standing. IEEE Signal Processing Society, December
works for images, speech, and time series. The hand- 2011. IEEE Catalog No.: CFP11SRW-USB.
book of brain theory and neural networks, 3361:310, [26] Alexander Sheh and Daniel PW Ellis. Chord segmenta-
1995. tion and recognition using em-trained hidden markov
[16] Honglak Lee, Peter Pham, Yan Largman, and Andrew Y models. Proceedings of the International Conference on
Ng. Unsupervised feature learning for audio classifica- Music Information Retrieval (ISMIR), pages 185–191,
tion using convolutional deep belief networks. In Ad- 2003.
vances in neural information processing systems, 2009. [27] J Sola and J Sevilla. Importance of input data normaliza-
[17] Kyogu Lee and Malcolm Slaney. Acoustic chord tran- tion for the application of neural networks to complex
scription and key extraction from audio using key- industrial problems. Nuclear Science, IEEE Transac-
dependent HMMs trained on synthesized audio. Audio, tions on, 44(3):1464–1468, 1997.
Speech, and Language Processing, IEEE Transactions
[28] Yushi Ueda, Yuuki Uchiyama, Takuya Nishimoto,
on, 16(2):291–301, 2008.
Nobutaka Ono, and Shigeki Sagayama. Hmm-based
[18] Thomas M Martinetz, Stanislav G Berkovich, and approach for automatic chord detection using refined
Klaus J Schulten. Neural-gas’ network for vector quan- acoustic features. In Proceedings of the International
tization and its application to time-series prediction. Conference on Acoustics, Speech and Signal Processing
Neural Networks, IEEE Transactions on, 4(4):558–569, (ICASSP), pages 5518–5521. IEEE, 2010.
1993.
[19] Matthias Mauch, Chris Cannam, Matthew Davies, Si-
mon Dixon, Christopher Harte, Sefki Kolozali, Dan
Tidhar, and Mark Sandler. Omras2 metadata project
2009. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), 2009.
[20] Matthias Mauch and Simon Dixon. Approximate note
transcription for the improved identification of difficult
chords. In Proceedings of the International Conference
on Music Information Retrieval (ISMIR), pages 135–
140, 2010.

Chord Detection Using Deep Learning

Uploaded by

Copyright:

Available Formats

Chord Detection Using Deep Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chord Detection Using Deep Learning

Uploaded by

Copyright:

Available Formats

CHORD DETECTION USING DEEP LEARNING

Xinquan Zhou Alexander Lerch

ABSTRACT 2. RELATED WORK

Figure 1. Visualization of a bottleneck architecture

the bottleneck-shaped architecture is that the number of

and the following frame. Thus, each superframe consists

4.2 Hidden Markov Model

Training Scenario Classifier WCSR Training

Method WCSR by incorporating multi-label learning proved to be less suc-

You might also like