Algorithmic Composition of Melodies With Deep Recurrent Neural Networks
Algorithmic Composition of Melodies With Deep Recurrent Neural Networks
Algorithmic Composition of Melodies With Deep Recurrent Neural Networks
1 Introduction
The algorithmic formalization of musical creativity and composition, foreseen as
early as the 19th century [1], has come to fruition in the recent decades with the
advent of modern computer algorithms [2].
Formally, a melody can be seen as a sample from a potentially very sophis-
ticated probability distribution over sequences of notes [2–5]. For monophonic
music, probability distributions could be given by Markov chains, where the
probability of the next note depends only on the current note and the k last
notes [4]. Markov chain models, however, do not capture the long-range tem-
poral structure inherent in music. For example, even a simple melody such as
Brother John is structured in four patterns, each repeated twice, with the first
and last ones starting with the same notes (see Fig. 1). Taking only the last few
notes into account is thus not enough to produce the sequence – rather does the
progression on the long timescale of bars dictate the sequences of notes.
Such rich temporal structure can be captured by models that rely on recur-
rent neural networks (RNN). Particularly well suited to capture these long-range
2 Colombo et al.
2 Methods
a a b b c c d d
4
4
A2 0000|0000|000|000|010000|010000 |000|000|0
G2 0000|0000|001|001|101000|101000 |000|000|0
Music F2 0000|0000|010|010|000100|000100 |000|000|0
representation E2 0010|0010|100|100|000010|000010 |000|000|0
D2 0100|0100|000|000|000000|000000 |000|000|0
C2 1001|1001|000|000|000001|000001 |101|101|0
G1 0000|0000|000|000|000000|000000 |010|010|0
pitch || 0000|0000|000|000|000000|000000 |000|000|1
Artificial neural networks have a long history in machine learning, artificial in-
telligence and cognitive sciences (see [21] for a textbook, [22] for recent advances,
[23] for an in-depth historical overview). Here we give a brief introduction for
readers unfamiliar with the topic.
Artificial neural networks are non-linear functions y = fw (x), where input x,
output y and the parameters (weights) w can be elements of a high-dimensional
space. A simple example of an artificial neural network with 2-dimensional in-
and outputs is given by y1 = tanh(w11 x1 +w12 x2 ) and y2 = tanh(w21 x1 +w22 x2 ),
which we write in short as y = tanh(wx). Characteristic of artificial neural
networks is that the building blocks consist of a non-linear function σ (like
tanh) applied to a linear function w11 x1 + w12 x2 , which is an abstraction of
4 Colombo et al.
the operation
of biological neurons. If these building blocks are nested, e.g.
y = σ3 w3 σ2 w2 σ1 (w1 x) , one speaks of multi-layer (or deep) neural net-
works, with layer-specific weights (w1 , w2 , . . .) and non-linearities (σ1 , σ2 , . . .).
Deep neural networks are of particular interest for the approximation of high-
dimensional and non-linear functions. For example, the function of recognizing
object i in photos can be approximated by adjusting the weights such that
output yi is 1 if and only if an image x of object i is given [22]. Formally, the
weights can be adjusted to minimize a cost function, like the averaged square
PS 2
loss between target and output L(w) = S1 s=1 ys − fw (xs ) for some known
input-output pairs (xs , ys ). Since artificial neural networks are differentiable in
the parameters w, this cost function is also differentiable and the parameters can
be adjusted by changing them in direction of the gradient of the cost function
∆w ∝ ∇w L(w).
In recurrent neural networks (RNN) the inputs and outputs are sequences
of arbitrary length and dimension. A simple example of a recurrent neural net-
work with one hidden layer is given by h[n] = σ(wxh x[n] + whh h[n − 1]) and
y[n] = σ(why h[n]), where x[n], h[n], y[n] is the n-th element of the input, hid-
den, output sequence, respectively. This network is recurrent, since each hidden
state h[n] depends on the previous hidden state h[n−1] and, therefore, on all pre-
vious input elements x[1], x[2], . . . , x[n]. While these recurrent neural networks
can in principle capture long-range temporal dependencies, they are difficult to
fit to data by gradient descent, since the gradient involves the recurrent weights
whh raised to high powers, which vanishes or explodes depending on the largest
eigenvalue of whh [6]. This problem can be avoided by a reparametrization of the
recurrent neural network (LSTM [6], GRU [9], other variants [20]). In Equation
1 we give the update equations for the GRU used in this study.
2.3 Model
A multi B
layer 3
RNNs 12
pitch pitch
3
duration 12 duration
Input Output
The update equations for the vector of layer activations hi [n], update gates
z [n] and reset gates ri [n] at note n for layer i ∈ {1, 2, 3} are given by
i
During training, the log-likelihood of model parameters θ for the rhythm and
melody networks are separately optimized by stochastic gradient ascent with
adaptive learning rate [24] (α = 10−3 , β1 = 0.9, β2 = 0.999 and = 10−8 ). The
log-likelihood of model parameters θ given the training songs is given by
S s −1
NX
1X 1
L(θ) = log P r xsj [n + 1] = 1 | previous notes and θ , (7)
S s=1 Ns − 1 n=1
where S is the total number of songs, Ns the length of song s and xs [n] is the
duration vector ds [n] for the rhythm network and the pitch vector ps [n] for the
melody network. The trainable model parameters θ are the connection matrices
wab for a ∈ {y i , hi }, b ∈ {hi , z i , ri } and wyo o , the gate biases biz and bir , the
output unit biases bo and the initial state of the hidden units hi [0].
Networks are trained on 80% of songs of the musical corpus and tested on
the remaining 20%. A training epoch consists of optimizing model parameters on
each of randomly selected 200 melodies from the training set, where parameters
are updated after each song. After each epoch, the model performances are
evaluated on a random sample of 200 melodies from the testing and the training
set. The parameters that minimize the likelihood on unseen data from the testing
set are saved and used as final parameters.
Melody generation is achieved by closing the output-input loop of Eqs. 1-
5. In each time step a duration and a pitch are sampled from the respective
probability distributions and used as inputs in the next time step.
3 Results
A B
1.0
0.8
Duration[n]
Pitch[n]
0.6
0.4
0.2
0.0
Duration[n+1] Pitch[n+1]
The analysis of the relation between the pitch and duration features revealed
that they are dependent, as expected from music theory. Therefore, we explicitly
modeled the distribution over upcoming pitches as depending on the upcoming
duration (dashed line in Fig. 2A), effectively splitting the joint distribution over
note duration and pitch into conditional probabilities.
To study song continuations, we present as input to the trained model the be-
ginnings of previously unseen songs (the seed ) and observe several continuations
that our model produces (see Fig. 4). From a rhythmical point of view, it is
interesting to notice that, even though the model had no notion of bars im-
plemented, the metric structure was preserved in the generated continuations.
Analyzing the continuations in Fig. 4A, we see that the A pentatonic scale that
characterizes the seed is maintained everywhere except for continuation number
2. This is noteworthy, since the model is trained on a dataset that does not only
contain music based on the pentatonic scale. Moreover, the rhythmic patterns
of the seed are largely maintained in the continuations. Rhythmical patterns
that are extraneous to the seed are also generated. For example, the pattern n1 ,
which is the inversion of pattern a, can be observed several times. Importantly,
the alternating structure of the seed (abacabad) is not present in the generated
continuations. This indicates that the model is not able to capture this level
of hierarchy. While less interesting than the first example from the rhythmical
point of view, the seed presented in Fig. 4B, is clearly divided in a tonic area and
a dominant area. Different continuations are coherent in the sense that they also
alternate between these two areas, while avoiding the subdominant area almost
everywhere.
8 Colombo et al.
A
seed
5
B
seed tonic
dominant
1
Fig. 4. Example of melody continuation and analysis: A The first line (seed) is
presented as input to the model. The next five lines are five different possible continua-
tions generated by our model. Rhythmical patterns present in the seed are labeled a, b,
c and d. The label n1 points at a new rhythmical pattern generated by the model. Un-
labeled bars show other novel rhythmical patterns that appear only once. B A second
example of song continuation, with analysis of tonal areas.
Here, we assume that the model is able to learn the temporal dependencies in
the musical examples over all timescales and use it to autonomously generate
new pieces of music according to those dependencies. For the results presented
here, we manually set the first two notes of the melody before generating notes
by sampling from the output distributions. This step can be automatized but is
Algorithmic Composition of Melodies with Deep Recurrent Neural Networks 9
Fig. 5. Example of an autonomously generated Irish tune: See text for details.
In this example a coherent temporary modulation to A minor was generated (bar 26).
Worth noticing are passages that are reminiscent of the beginning at bar 13 and at bar
23.
4 Discussion
allows the autonomous composition of new and complete musical sequences. The
generated songs exhibit coherent metrical structure, in some cases temporary
modulations to related keys, and are in general pleasant to hear.
Using RNNs for algorithmic composition allows to overcome the limitations
of Markov chains in learning the long-range temporal dependencies of music
[18]. A different class of models, namely artificial grammars, are naturally suited
to generate these long-range dependencies due to their hierarchical structure.
However, artificial grammars have proven to be much harder to learn from data
than RNNs [25]. Therefore, researchers and composers usually define their own
production rules in order to generate music [2]. Attempts have been made to infer
context-free grammars [26] but applications to music are restricted to simple
cases [27].
Evolutionary algorithms constitute another class of approaches that has been
very popular in algorithmic composition [2]. They require the definition of a fit-
ness function, i.e. a measure of the quality of musical compositions (the individ-
uals, in the context of evolutionary algorithms). Based on the fitness function,
the generative process corresponding to the best individuals is favoured and can
undergo random mutations. This type of optimization process is not convenient
for generative models for which gradient information is available, like neural
networks. However, it can be used in rule-based generative models for which
training is hard [28]. A common problem with evolutionary algorithms in music
composition is that the definition of the fitness function is arbitrary and requires
some kind of evaluation of the musical quality. This resulted in fitness functions
often very specific to a certain style. However, similarly to neural networks, fit-
ness functions can be defined based on the statistical similarity of the individuals
to a target dataset of compositions [29].
Because of the ease of fitting these models to data as well as their expres-
siveness, learning algorithmic composition with recurrent neural networks seems
promising. However, further quantitative evaluations are desirable. We see two
quite different approaches for further evaluation. First, following standard prac-
tice in machine learning, our model could be compared to other approaches
in terms of generalization, which is measured as the conditional probability of
held-out test data (songs) of similar style. Second, the generated songs could
be evaluated by humans in a Turing test setting or with a questionnaire follow-
ing the SPECS methodology [30]. Another direction for interesting future work
would be to fit a model to a corpus of polyphonic music and examining the
influence of different representations.
References
1. Ada Lovelace. Notes on L Menabrea’s “sketch of the analytical engine invented by
Charles Babbage, esq.”. Taylor’s Scientific Memoirs, 3, 1843.
Algorithmic Composition of Melodies with Deep Recurrent Neural Networks 11