Garimella Thesis
Garimella Thesis
Garimella Thesis
by
Baltimore, Maryland
July, 2012
Abstract
Artificial Neural Networks (ANNs) have been widely used in a variety of
speech processing applications. They can be used either in a classification or regression mode. Proper regularization techniques are necessary when training these
networks, especially in scenarios where the amount of training data is limited or the
number of layers in a network is large. In this thesis, we explore alternative regularized feed-forward neural network architectures and propose learning algorithms for
speech processing applications such as phoneme recognition and speaker verification.
In a conventional hybrid phoneme recognition system, a multilayer perceptron (MLP) with a single hidden layer is trained on standard acoustic features to
provide the estimates of posterior probabilities of phonemes. These estimates are
used for decoding the underlying phoneme sequence. In this thesis, we introduce a
sparse multilayer perceptron (SMLP) which jointly learns an internal sparse feature
representation and nonlinear classifier boundaries to discriminate multiple phoneme
classes. This is achieved by adding a sparse regularization term to the original crossentropy cost function. Instead of MLP, SMLP is used in a hybrid phoneme recognition
ii
ABSTRACT
system. Experiments are conducted to test various feature representations, including the proposed data-driven discriminative spectro-temporal features. Significant
improvements are obtained using these techniques.
Another application where neural networks are used is in speaker verification. Auto-Associative Neural Network (AANN) is a fully connected feed-forward
neural network, trained to reconstruct its input at its output through a hidden compression layer. AANNs are used to model speakers in speaker verification, where a
speaker-specific AANN model is obtained by adapting (or retraining) the Universal
Background Model (UBM) AANN, an AANN trained on multiple held out speakers, using corresponding speaker data. When the amount of speaker data is limited,
this procedure may lead to overfitting as all the parameters of UBM-AANN are being adapted. To alleviate this problem, we regularize the parameters of AANN by
developing subspace methods namely weighted least squares (WLS) and factor analysis (FA). Experimental results show the effectiveness of the subspace methods over
directly adapting a UBM-AANN for speaker verification.
Thesis Committee
Hynek Hermansky, Sanjeev Khudanpur, Trac Tran, Daniel Povey, Nelson Morgan.
iii
Acknowledgments
I owe my deepest gratitude to my supervisor Prof. Hynek Hermansky, whose
insight, encouragement and guidance made this work possible. It is an honor for me
to work with him.
I am indebted to Andrea Ridolfi, Sanjeev Khudanpur, Trac Tran, Donniell
Fishkind, Carey Priebe, Rene Vidal and Daniel Povey for offering graduate level
courses that enabled me to learn and appreciate mathematical rigour. I would like to
thank my internship hosts Pedro Moreno, Olivier Siohan at Google for providing me
the opportunity to gain valuable research experience.
I am grateful to Joel Pinto, Nima Mesgarani, Sridhar Krishna Nemala, Sriram Ganapathy, Balakrishnan Varadarajan and Samuel Thomas for the scientific
collaboration, and also several colleagues and friends for their support during the
PhD.
Finally, I would like to thank my mother and sisters for their infinite love
and support.
iv
Dedication
This thesis is dedicated to my mother.
Contents
Abstract
ii
Acknowledgments
iv
List of Tables
List of Figures
xi
1 Introduction
1.1
Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
1.3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
2.1
Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
vi
CONTENTS
2.4
2.3.1
10
2.3.2
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
12
2.4.1
Theory of SMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
14
15
l1
Gradient of L w.r.t. wij
. . . . . . . . . . . . . . . . . . . . . . . .
16
Update Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
16
System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.5.1
Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.5.2
Feature Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
18
19
19
Hierarchical SMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
20
22
2.5.4
Dempster-Shafer Combination . . . . . . . . . . . . . . . . . . . . .
22
2.5.5
23
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.4.2
2.5
2.5.3
2.6
vii
CONTENTS
2.7
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
27
3.1
Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.2
29
3.3
30
3.3.1
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.2
UBM-AANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.3.3
Speaker-Specific AANN . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3.4
Score Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.4
33
3.5
35
3.5.1
36
3.5.2
T-matrix Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.5.3
i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.5.4
PLDA training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.5.5
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.6
41
4.1
Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
4.2
42
viii
CONTENTS
4.3
4.4
46
4.3.1
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.3.2
T-matrix Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4.3.3
i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.4.1
49
5 Conclusions
51
5.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
5.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
53
Bibliography
55
ix
List of Tables
2.1
2.2
2.3
2.4
3.1
3.2
4.1
4.2
PER (in %) on TIMIT test set for various acoustic feature streams using hierarchy of multilayer perceptrons. Last column indicates the results of feature
stream combination at the hierarchical posterior level using the DempsterShafer theory of evidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average measure of sparsity of the first hidden layer outputs of SMLP and
four layer MLP for various feature streams. . . . . . . . . . . . . . . . . . .
PER (in %) on TIMIT test set for various acoustic feature streams when
the 3-state phoneme posteriors are obtained using a single layer perceptron
trained on the first hidden layer outputs of SMLP or MLP classifier. . . . .
PER (in %) on TIMIT test set using a single multilayer perceptron (without
hierarchy). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Description of various telephone conditions of NIST-08. . . . . . . . . . . .
EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and
C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and
C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with the state-of-art GMM based i-vector/PLDA system. EER
in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8
of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
25
26
26
39
39
48
49
List of Figures
1.1
1.2
2.1
2.2
2.3
2.4
2
5
. . .
. . .
. . .
fully
. . .
8
10
17
3.1
3.2
3.3
28
29
36
4.1
xi
21
Chapter 1
Introduction
Artificial Neural Networks (ANNs) are used in many speech processing applications
such as speech activity detection (SAD) [1], keyword spotting (KS) [2, 3], automatic speech
recognition (ASR) [48], speaker verification (SV) [911] and language identification (LID)
[12] etc. Most of these applications use a feed-forward multilayer perceptron (MLP) which
is described in section 1.1.
1.1
Multilayer Perceptron
MLP is a fully connected feed-forward neural network with multiple hidden layers,
depicted in Fig. 1.1. Each node in any layer (except output) is connected to every other
node in the subsequent layer using a set of weights. The output of any node is obtained by
applying a specified transformation (linear or non-linear) on the sum of weighted combination of previous layer outputs and the node-specific bias value. Thus an MLP transforms
inputs to outputs using a set of weights and biases (parameters).
CHAPTER 1. INTRODUCTION
Hidden
layers
Input
layer
Output
layer
Output
Input
Let {W1 , W2 , . . . , Wm1 }, {b1 , b2 , . . . , bm1 } be the set of weight matrices and
bias vectors respectively representing parameters of an MLP, where Wi indicates the weights
connecting ith layer and (i + 1)th layer, and bi indicates the bias vector of (i + 1)th layer.
The (element-wise) non-linearity applied at the ith layer be i . Typically, a sigmoid nonlinearity is used at the hidden layers, and a linear or softmax is applied at the output layer.
For an input vector f , the MLP output vector o (f ) can be expressed as,
o (f ) = m Wm1 . . . 3 W2 2 W1 f + b1 + b2 . . . + bm1 .
CHAPTER 1. INTRODUCTION
This output must be close to the desired vector d (f ) which depends on whether
the task is classification or regression. In either cases, MLP is trained to minimize the
discrepancy between its output vector o (f ) and the desired vector d (f ) over the training
data. Due to the high non-linear dependency between input and output, MLP is trained
using the stochastic gradient descent. The gradient can be computed efficiently using the
error back-propagation algorithm. However, proper regularization on the parameters of an
MLP is desirable when the amount of training data is limited or the number of layers in
a network is large. Such an alternative regularized neural network architectures require
learning algorithms different from existing methods. The focus of this thesis is to develop
such algorithms and apply them to applications such as speech and speaker recognition.
1.2
applicable to other applications as well. For instance, MLP can be trained to extract datadriven discriminative features. These features are used in ASR as described in TANDEM
framework [5]. The subspace methods developed in this thesis may form the basis for
adapting MLPs using limited amount of adaptation data. The underlying principles seem
to be attractive for speaker-specific adaptation of neural network based acoustic model for
ASR. Additionally, neural network speaker verification systems based on subspace methods
yield performances comparable to the state-of-the-art systems. This may facilitate further
research on the application of neural networks in speaker verification.
CHAPTER 1. INTRODUCTION
1.3
Contributions
The contributions of this thesis are :
1.4
The block diagram depicting the organization of this thesis is shown in Fig. 1.2. The core of
this thesis is to develop learning algorithms for training various regularized neural networks
(MLPs). A brief description summarizing the focus of various chapters is provided below.
CHAPTER 1. INTRODUCTION
Speaker
Verification
FA
Adaptation
AANN
Regression
MLP
Phoneme
Recognition
SMLP
Classifier
AANN
Regression
Chapter2
Chapter4
WLS
Adaptation
Chapter3
Speaker
Verification
tion system built using the SMLP and various feature representations shown to perform
significantly better than the MLP counterpart.
In chapter 4, all adaptation parameters of AANNs are regularized by restricting them to a common low-dimensional subspace using factor analysis (FA) technique. A
CHAPTER 1. INTRODUCTION
speaker verification system based on FA yields better results than the WLS based AANN
speaker verification system developed in chapter 3. Our experiments also show that this
technique yields an order of magnitude better performance than the existing neural network
based speaker verification system.
Finally, the conclusions of this thesis along with the future work is provided in
chapter 5.
Chapter 2
Chapter Outline
In a conventional hybrid phoneme recognition system, shown in Fig. 2.1, a multi-
layer perceptron (MLP) with a single hidden layer is trained on standard acoustic features to
estimate the posterior probabilities of phonemes [4]. These estimates are used for decoding
the underlying phoneme sequence. First, we propose to derive a spectro-temporal feature
representation by applying multiple linear discriminant analysis (MLDA) technique. Second, we introduce a sparse multilayer perceptron (SMLP) which jointly learns an internal
sparse feature representation and nonlinear classifier boundaries to estimate the phoneme
posterior probabilities. Experimental results show that the proposed techniques improve
the phoneme error rate (PER).
Speech
Hybrid HMM
Feature
extraction
MLP
Acoustic features
Phonemes
decoding
2.2
Background
The task of phoneme recognition is to convert a speech waveform into a sequence
of underlying sound units known as phonemes. The block diagram of a hybrid phoneme
recognition system consisting of feature extraction, MLP and hybrid hidden markov model
(HMM) decoding is shown in Fig. 2.1.
The purpose of feature extraction is to discard information irrelevant for performing the task. Features are usually extracted from the two-dimensional representation of
speech such as spectrogram. Depending on the manner in which features are derived, they
can be broadly classified into spectral [13,14], temporal [1517] or spectro-temporal [1820]
features.
These features are used as input to train an MLP. It has been shown [21] that
MLP, trained to minimize the cross-entropy cost between its outputs and hard targets1 using
sufficient amount of data, estimates the posterior probabilities of output classes conditioned
on the input feature vector in a discriminative manner. This has led to an extensive use of
1
A hard target vector consist of all zeros except a one at the index corresponding to the phoneme to
which current input feature vector belongs.
2.3
increased research effort in deriving the features that explicitly capture these dynamics [1820, 28]. Such an approach is primarily motivated by the spectro-temporal receptive
field (STRF) model for predicting the response of a cortical neuron to the input speech,
where STRF describes the two-dimensional spectro-temporal pattern to which the neuron
is most responsive to [29].
Most of the works so far have used the parametric two-dimensional Gabor filters
for extracting features. The parameters of the Gabor functions are optimized to improve
the recognition accuracy [18], [28] or grouped into low and high frequency modulations to
form various streams of information [19]. Even though multiple spectro-temporal feature
streams were formed and combined using MLPs in [19], it is difficult to interpret what
each feature stream is trying to achieve. We propose to extract features using a set of twodimensional filters to discriminate each phoneme from the rest of the phonemes as described
below [20, 30].
Speech
STFT
Bark frequency
warping
log
Spectrotemporal
representation
2.3.1
for both learning the two-dimensional filter shapes and extracting the features. Fig. 2.2
shows the steps involved in extracting such a representation. This representation is obtained
by first performing a Short Time Fourier Transform (STFT) on the speech signal with an
analysis window of length 25 ms and a frame shift of 10 ms. The magnitude square values
of the STFT output in each window are then projected on to a set of overlapping positive
weight vectors in such a way that their centers are equally spaced on the Bark frequency
scale to obtain the spectral energies in various critical bands. Finally, the spectro-temporal
representation is obtained by applying a logarithm on critical band energies.
In order to derive 2-D filter shapes, we ask the question : what are the directions
2
10
(patterns) along which the spectro-temporal patterns of a phoneme are well separated from
that of the rest of the phonemes? Fisher Linear Discriminant Analysis (FLDA) gives only
one optimal discriminating pattern for each phoneme (two-class problem) due to the rank
limitation of its between-class scatter matrix. The resultant low-dimensional feature space
(projections of a spectro-temporal pattern on these discriminating patterns) may hinder
classification performance.
We apply MLDA to learn multiple discriminating patterns to discriminate spectrotemporal patterns of a phoneme from that of the rest of the phonemes. MLDA features
are obtained by projecting a spectro-temporal patch onto these discriminating patterns as
described in section below. For instance, 13 discriminating patterns per phoneme would
yield a feature vector of length 13 39 = 507. Discriminating patterns can be interpreted
as spectro-temporal filters by flipping them in time around the center.
2.3.2
Feature Extraction
If S(n, k) denotes the spectro-temporal representation of the speech, and h(n, k) characterizes a discriminating pattern, then the corresponding feature f (n) at a particular time
11
N X
K
X
h(i, k) S(n + i, k)
(2.1)
i=N k=1
where, i and k denote the discrete time index (due to 10ms shift) and the critical band
index respectively. K represents the total number of critical bands, while the temporal
extent (context) of the 2-D filter is given by 2N + 1.
2.4
cortex [34, 35], and recently, many pattern classification applications have made use of
sparse signal representations [3642]. Most of these methods treat sparse representation
as features and train an additional classifier for making decisions. However, in only a few
instances have sparse representations been optimized in conjunction with the classifier for
discriminative classification. Some of the previous works have attempted to address this
issue. For example, a two-class classification problem with a linear or bilinear classifier has
been considered in [39]. In a different work, Fishers linear discrimination criterion with
sparsity is used [36].
We propose to jointly learn both sparse features and nonlinear classifier boundaries that best discriminate multiple output classes. Specifically, we propose to learn sparse
features at the output of a hidden layer of a MLP trained to discriminate multiple output
classes. This is achieved by adding a sparse regularization term to the conventional crossentropy cost between the target values and their predicted values at the output layer. The
12
parameters of the MLP are learned to minimize the joint cost using the standard backpropagation algorithm which takes the additional sparse regularization term into consideration. The resultant model is referred to as the sparse multilayer perceptron (SMLP). Further, under certain conditions, described in section 2.4.2, the SMLP estimates the Bayesian
a posteriori probabilities of the output classes conditioned on the sparse representation.
2.4.1
Theory of SMLP
The notations used are as follows.
The goal of an SMLP classifier is to jointly learn sparse features at the output of its
pth layer and estimate posterior probabilities of multiple classes at its output layer. In the
case of MLP, estimates of the posterior probabilities are typically obtained by minimizing
the cross-entropy cost between the output layer values (after the softmax) and the hard
targets. We modify this cost function for SMLP as follows.
13
Cost Function
The two objectives of the SMLP are minimize the cross-entropy cost between the output layer values and the hard targets,
and
force the outputs of the pth layer to be sparse for a particular p {2, 3, ..., m 1}.
The instantaneous 3 cross-entropy cost is
L=
Nm
X
dj log yjm .
(2.2)
j=1
X
log 1 + (yjp )2 .
L = L +
2
(2.3)
j=1
where is a positive scalar controlling the trade-off between the sparsity and the cross
Np
P
entropy cost. The function
log 1 + (yjp )2 which is continuous and differentiable everyj=1
where, was successfully used in previous works to obtain a sparse representation [34, 37, 42,
42]. The weights of the SMLP are adjusted to minimize (2.3), and is discussed below.
14
For training SMLP, the error back-propagation must be modified in order to accommodate
the additional sparse regularization term.
We derive update equations for training the SMLP by minimizing the cost function (2.3) with respect to weights4 over the training data. Since the learning is based on
stochastic gradient descent, the key is to determine the gradient of the cost function (2.3)
with respect to the weights.
(2.4)
yjp
1 + (yjp )2
(2.5)
Using (2.3) and chain rule of calculus, (l 1) {2, 3..., p 1}, i {1, 2..., Nl1 },
L
yil1
Nl
X
j=1
Nl
X
j=1
L
yjl
L
yjl
yjl
xlj
xlj
yil1
0
l1
l xlj wij
.
(2.6)
The above equations (2.4),(2.5) and (2.6) indicate that the gradients of L w.r.t. yjl
can be computed from the gradients of L w.r.t. yjl . Specifically, we need the gradients of L
w.r.t. yjl , l {p, p + 1, ..., m}, j {1, 2, ..., Nl } in order to compute gradients of L w.r.t.
4
The bias values at any layer can be interpreted as weights connecting an imaginary node in the previous
layer, with its output being unity, and all the nodes in the current layer.
15
l1
Gradient of L w.r.t. wij
l1
By definition, wij
denotes the weight connecting the ith neuron in (l 1)th layer and j th
L
yjl
L
yjl
!
!
yjl
xlj
xlj
l1
wij
0
l xlj yil1 .
(2.7)
Update Equations
SMLP weights are updated using stochastic gradient descent. The gradient of the cost
function (2.7) with respect to a particular weight is accumulated for several input patterns
known as bunch size and then the weight is updated using
l1
l1
wij
wij
h
2.4.2
L
l1 i
wij
L
i,
l1
wij
(2.8)
mensionality of its input acoustic feature vector and the number of output phoneme classes,
respectively. Softmax nonlinearity is used at its output layer, and weights are adjusted to
minimize (2.3) when the hard targets are being used. Note from the equations (2.4), (2.5),
(2.6) and (2.7) that the sparse regularization term affects the update of only those weights
16
Hierarchical
PLP
Speech
FDLP
SMLP
DempsterShafer
Hierarchical
Combination
SMLP
Hybrid HMM
decoding
Phonemes
Hierarchical
MLDA
SMLP
be adjusted to minimize the cross-entropy term of (2.3) without affecting the sparse regularization term. If p < m 1 and one of the hidden layers between pth and mth layers
is sigmoidal (nonlinear) then the pth layer outputs can be nonlinearly transformed to the
SMLP outputs. Therefore, in such a case, SMLP estimates the posterior probabilities of
output classes conditioned on the pth layer outputs (sparse representation). This follows
from the fact that an MLP with a single nonlinear hidden layer estimates the posterior
probabilities of output classes conditioned on the input features [21], and SMLP outputs
are completely determined by the outputs of pth layer (hidden).
2.5
System Description
The block diagram of the phoneme recognition system used in our experiments is
shown in Fig. 2.3. The various components of the system are described below.
17
2.5.1
Database
Phoneme recognition experiments are conducted on the TIMIT database [31]. It consists
of 630 speakers with 10 utterances per speaker sampled at 16 kHz. The two SA dialect sentences per speaker are excluded from the setup as they are identical across all the speakers.
The original TIMIT train and test sets consist of 462 and 168 speakers respectively [31].
We further divide the original train set into training and validation sets having 425 and
37 speakers, and keep the original test set unchanged. Thus in all our experiments, the
training, validation and test sets consist of 3400, 296 and 1344 utterances from 425, 37 and
168 speakers, respectively.
2.5.2
Feature Streams
To test the proposed SMLP classifier, we developed system using three different
feature streams, namely PLP cepstral coefficients [13], FDLP temporal features [17] and
MLDA spectro-temporal features [20]. These features are extracted for every 10 ms of
speech, and they are normalized for speaker specific mean and variance. A detailed description of each feature stream is provided below.
The resultant spectral envelopes are smoothed by the twelfth order linear prediction analysis [13]. The top 13 cepstral coefficients are concatenated with the corresponding delta and
delta-delta features to obtain 39 dimensional feature vector. A nine frame context of these
vectors is used as the input PLP feature stream.
2.5.3
Hierarchical SMLP
Hierarchies of multilayer perceptron (MLP) classifiers have been shown to be useful
for acoustic modeling in speech recognition [4345], model adaptation [46] and language
identification [47]. A hierarchical MLP consists of two MLPs in series which are sequentially
trained. The first MLP uses standard acoustic feature vectors to estimate the posterior
probabilities of various output classes such as phonemes. The second MLP is then trained
on the same targets using long temporal spans of posterior probabilities estimated by the
first MLP as inputs.
Block diagram of the hierarchical SMLP is shown in Fig. 2.4. Initially, a four layer
SMLP is trained to estimate the 3-state phoneme posterior probabilities. Subsequently, another three layer MLP is trained on a long temporal span of these posteriors to estimate the
single state phoneme posterior probabilities. Both these networks are initialized randomly
using uniform noise and trained using back-propagation. We have modified the Quicknet
package [48] (software for MLP training) to perform SMLP training.
20
MLP
SMLP
input
hidden
sparse
layer2
hidden
outputs
hidden
layer
output
input
temporal
context of
230 ms
Acoustic
features
output
Single state
phoneme
posterior
probabilities
(23 frames)
Figure 2.4: Block diagram of hierarchical SMLP. Though both the networks are fully connected, only a portion of the connections are shown for clarity.
As shown in Fig. 2.4, the SMLP used for estimating the 3-state phone posterior
probabilities consists of four layers (m = 4): an input layer to receive a given feature
stream, two hidden layers with a sigmoid nonlinearity, and an output layer with a softmax
nonlinearity. The number of nodes in the input and output layers is set to be equal to the
dimensionality of the input feature vector and the number of phoneme states (i.e., 49 x 3
= 147) respectively. The outputs of the first hidden layer (p = 2) are forced to be sparse
with number of nodes in it being same as that of the input layer. The number of nodes in
the second hidden layers is chosen to be 1000. For each feature stream, the value of in
the SMLP cost function (2.3) is chosen to minimize the phoneme error rate (PER) on the
validation data.
In the first pass of SMLP classifier training, 3-state hard phoneme targets are
obtained by segmenting each phoneme in the training data equally into three states i.e.,
start, middle and end. This classifier is retrained in a second pass using the hard targets
corresponding to the best state alignment obtained by applying the Viterbi algorithm on
21
3-state posterior probability estimates of the first pass. Frame classification accuracy on
the validation set is used to control the learning rate and to terminate training.
2.5.4
Dempster-Shafer Combination
22
2.5.5
The 49 phoneme classes are mapped to 39 phoneme classes for decoding5 [32]. The posterior
probabilities of phoneme classes are converted to the scaled likelihoods by dividing them
by the corresponding prior probabilities of phonemes obtained from the training data. A
3-state HMM (connected from left to right) with equal self-transition and state transition
probabilities is used to model each phoneme. The emission likelihood of each state is
set to be the scaled likelihood. A bigram phonotactic language model is used in all the
experiments. Finally, the Viterbi algorithm is applied for decoding the phoneme sequence.
The PER is obtained by comparing the decoded phoneme sequence against the reference
sequence. While evaluating the performance on the test set, the language model scaling
factor is chosen to minimize the PER of the validation data.
2.6
Experimental Results
Table 2.1 shows the PER of the proposed SMLP based hierarchical hybrid system
and the baseline MLP based hierarchical hybrid systems for various feature streams on the
TIMIT test set. As described earlier in section 2.5.3, proposed and baseline systems differ
only in the way 3-state phoneme posteriors are estimated. Results indicate that four layer
MLP based system performs better than conventional three layer MLP based system for
each feature stream. Moreover, the SMLP based system outperforms the baseline four layer
MLP based system for each feature stream. This improved performance can be attributed
to the sparse regularization term.
5
The appropriate subsets of 49 phoneme posterior probability estimates are summed to get 39 phoneme
probability estimates.
23
FDLP
MLDA
PLP+FDLP+MLDA
MLP (3 layers)
22.9
23.2
22.8
20.5
MLP (4 layers)
22.6
22.8
22.4
20.1
SMLP (4 layers)
21.9
22.1
21.9
19.6
To exploit the complementary nature of these acoustic feature streams, they are
combined at the hierarchical posterior probability level as described in section 2.5.4. The
system combination results are shown in the last column of Table 2.1. It can be observed
that the combination of SMLP based systems yields a PER of 19.6%, a relative improvement
of 2.5% over the combination of four layer MLP (i.e., = 0) based systems. These results
are statistically significant with a p-value of less than 0.0004. Further, the information
transferred6 for SMLP and MLP (4 layers) based systems is 3.46 and 3.42 bits respectively.
On the TIMIT core test set consisting of 192 utterances (a subset of the test set provided
by LDC [31]), we obtain a PER of 20.7% using the combination of SMLP based systems.
This performance compares well with the existing state-of-the-art systems7 .
6
24
2.7
Analysis
First, we quantify the sparsity of pth hidden layer outputs using the following
measure () [54].
N2
P
p
N2 si=1
=
N2 1
N2
P
1
|yip |
i=1
It is to be noted that
yip
p 2
(yi )
(2.9)
0 1, and the value of is one for maximally sparse and close to zero for minimally
sparse representations. Table 2.2 lists the average value of the first hidden layer outputs
over the validation data for various phoneme recognition systems. As expected, SMLP based
systems have significantly higher values than four layer MLP systems which indicates the
effectiveness of the sparse regularization term.
Table 2.2: Average measure of sparsity of the first hidden layer outputs of SMLP and four
layer MLP for various feature streams.
PLP
FDLP
MLDA
MLP (4 layers)
0.275
0.278
0.282
SMLP (4 layers)
0.496
0.552
0.540
Table 2.3 shows the PER of the resulting hierarchical system for various acoustic feature
streams. The linear classifier is able to model the SMLP features better than it models the
MLP features.
Table 2.3: PER (in %) on TIMIT test set for various acoustic feature streams when the
3-state phoneme posteriors are obtained using a single layer perceptron trained on the first
hidden layer outputs of SMLP or MLP classifier.
PLP
FDLP
MLDA
26.9
27.0
26.5
25.0
25.5
24.6
Finally, results using only a single multilayer perceptron (without hierarchy) are
analyzed. A single multilayer layer perceptron (SMLP or MLP) is trained directly to estimate the single state phoneme posterior probabilities which are decoded as described in
section 2.5.5. The value of in SMLP cost function is optimized on the validation data.
Table 2.4 summarizes the PER of various feature streams. It can be observed from this table
that the SMLP system consistently outperforms the corresponding baseline MLP systems.
Table 2.4: PER (in %) on TIMIT test set using a single multilayer perceptron (without
hierarchy).
PLP
FDLP
MLDA
MLP (3 layers)
27.2
26.3
27.1
MLP (4 layers)
27.3
27.5
27.0
SMLP (4 layers)
26.6
25.8
26.0
26
Chapter 3
Chapter Outline
Auto-Associative Neural Network (AANN) [55] is a fully connected feed-forward
neural network trained to reconstruct the input at its output through a hidden bottleneck
layer. Existing AANN based speaker verification systems [9,10] use the reconstruction error
difference computed using the universal background model (UBM) AANN and the speakerspecific AANN models as a score for making decision, as shown in the Fig. 3.1. The
UBM-AANN is obtained by training an AANN on multiple held out speakers. Where as
the speaker-specific AANN is obtained by adapting (or retraining) the UBM-AANN using
27
Test
speech
Score
Feature
extraction
Difference
Speaker specific
AANN
Figure 3.1: Block schematic of the AANN based speaker verification system.
corresponding speaker data.
The remainder of this chapter is organized as follows. AANNs are introduced in the
next section. Section 3.3 describes the previously proposed speaker verification system. The
28
Nonlinear hidden
layers
Linear input
layer
Linear output
layer
Input
Output
Number 39
of nodes
20
39
39
proposed WLS formulation of AANNs is presented in section 3.4. The speaker verification
system based on WLS of AANNs is described in section 3.5. Finally, experimental results
are shown in section 3.6.
3.2
for modeling the distribution of data [9], and some of its advantages are - it relaxes the
assumption of feature vectors to be locally normal and can capture higher order moments.
For an input vector f , the network produces an output f (f , ) which depends both
on the input f and the parameters of the network (the set of weights and biases). For
simplicity, we denote the network output as f . While training the network, the parameters
of the network are adjusted to minimize typically the average squared error cost between the
input f and the output f over the training data as in (3.1). The network is trained using the
stochastic gradient descent, where gradient is computed using the error back-propagation
algorithm.
h
i
min E kf f k2 .
{}
(3.1)
Once this network is well trained, the average reconstruction error of input vectors that are
drawn from the distribution of the training data will be small compared to vectors drawn
from a different distributions [9].
3.3
The goal of the speaker verification is to verify whether a given utterance belongs
to a claimed speaker or not based on a sample utterance from claimed speaker. In other
words, the task is to verify whether a given two utterances of a speaker verification trial
belong to the same speaker or not. The block diagram of a previously proposed AANN
based speaker verification system [9, 10] is shown in the Fig. 3.1. The various components
30
3.3.1
Feature Extraction
The acoustic features used in our experiments are 39 dimensional frequency domain linear prediction (FDLP) features [5659]. In this technique, sub-band temporal envelopes of
speech are first estimated in narrow sub-bands (96 linear bands). These sub-band envelopes
are then gain normalized to remove reverberation and channel artifacts. After normalization, the frequency axis is warped to 37 Mel bands in the frequency range of 125-3800 Hz to
derive a gain normalized mel scale energy representation of speech. These mel band energies are converted to cepstral coefficients by applying a log and Discrete Cosine Transform
(DCT). The top 13 cepstral coefficients along with derivative and acceleration components
are used as features, yielding 39 dimensional feature vectors. Finally, a subset of these
feature vectors corresponding to speech are selected based on the voice activity detection
information provided by NIST.
3.3.2
UBM-AANN
The concept of UBM is introduced in [60], where a GMM trained on data from
multiple speakers is used as a UBM. In our work, UBMs are obtained by training AANNs
on development data consisting of multiple speakers (described below) [9].
Gender-specific AANN based UBMs are trained on a telephone development data
set consisting of audio from the NIST 2004 speaker recognition database, the Switchboard-2
Phase III corpora and the NIST 2005 speaker recognition database. We use only 400 male
and 400 female utterances each corresponding to about 17 hours of speech.
31
AANN based UBMs are trained using the FDLP features (see Section 3.3.1) to
minimize the reconstruction error loss function described in section 3.2. Each UBM has a
linear input and a linear output layers along with three nonlinear (tanh nonlinearity) hidden
layers. Both input and output layers have 39 nodes corresponding to the dimensionality of
the input FDLP features. First, second and third hidden layers have 20, 6 and 39 nodes
respectively. Schematic of an AANN with this architecture is shown in the Fig. 3.2. We
have modified the Quicknet package for training the AANNs [48].
3.3.3
Speaker-Specific AANN
A speaker-specific AANN model is obtained by retraining the entire UBM-AANN using the
corresponding speaker data [9]. However, we have observed that the performance can be
improved by retraining only the weight matrix connecting third hidden layer and output
layer. This may be due to limited amount of speaker-specific data. Thus, an improved
baseline is used in our experiments, where a speaker-specific AANN is obtained by adapting
only UBM-AANN weights that impinge on the output layer.
3.3.4
Score Computation
During the test phase, the average reconstruction error of the test data is computed under
both the UBM-AANN and the claimed speaker AANN models. The final score of a trial is
computed as the difference between these average reconstruction errors. In the ideal case,
if the claim is verified, the average reconstruction error of the test data is large under the
UBM-AANN model than the claimed speaker AANN model, and vice versa.
32
3.4
UBM and its maximum a posteriori (MAP) adapted speaker-specific model for making decision [60]. More recently proposed GMM factor analysis techniques use a low-dimensional
subspace(s) in part of the GMM parameter space to model the speaker and channel variabilities [6165], and extract coordinates, known as i-vector [63], corresponding to a given
utterance of a speaker in the low-dimensional subspace. The i-vectors are treated as features while training the probabilistic linear discriminant analysis (PLDA) for hypothesis
testing [66, 67].
1
l(s)
l(s)
m X
m X
X
X
w =
nl,s
nl,s wl,s ,
s=1 l=1
s=1 l=1
s=1 l=1
s=1 l=1
1
l(s)
l(s)
m X
m X
X
X
=
nl,s
nl,s (wl,s w) (wl,s w)T .
We model wl,s using a low dimensional affine subspace parameterized by a matrix
T i.e., wl,s w + Tql,s , where ql,s represents the unknown i-vector associated with the
lth session of sth speaker. To find T, the following weighted least squares cost function is
minimized with respect to its arguments:
L T, q1,1 , . . . , ql(m),m
=
l(s)
m X
X
s=1 l=1
l(s)
m X
X
{z
(3.2)
regularization term
where k.kA denotes a norm given by kxk2A = xT Ax, and is a small positive constant.
Differentiating (3.2) with respect to ql,s and setting equal to zero yields,
L
=0
ql,s
TT 1 nl,s [wl,s (w + Tql,s )] + ql,s = 0
1 T 1
T nl,s (wl,s w)
ql,s = I + TT 1 nl,s T
1
34
(3.3)
l(s)
m X
X
T
nl,s T I + ql,s ql,s =
1 nl,s [wl,s w] qTl,s
s=1 l=1
(3.4)
s=1 l=1
To solution for T is obtained by applying the coordinate gradient descent i.e., (3.3)
and (3.4) are iteratively solved for several times. In other words, for a given T, we first find
the i-vectors {q1,1 , . . . , ql(m),m } using (3.3). In the next step, we solve for T in (3.4) using
the i-vectors {q1,1 , . . . , ql(m),m } found in the previous step. This procedure is repeated for
several times until convergence.
The above update equations can be compared with the total variability space
training of GMMs [63]. Note that (3.3) and (3.4) resemble the maximum likelihood (ML)
update equations in [63], except for the I term in (3.4).
3.5
Fig. 3.3. This system uses the FDLP features described in section 3.3.1. The description of
UBM-AANN can be found in section 3.3.2. The remaining components of the system are
described below.
35
UBMAANN
training
PLDA
training
T matrix
training
FDLP
features
ivector
extraction
Adapt
UBMAANN
Hypothesis
testing
3.5.1
is adapted for each utterance to obtain a speaker specific model. It is possible to derive the
closed-form solution for updating the speaker-specific weight matrix Wl,s connecting third
hidden and output layers. The output bias vector b of UBM-AANN is not adapted.
Let fi,l,s be the ith feature vector (frame) of an utterance corresponding to lth
session of sth speaker, and n(l, s) be the number of such frames in that utterance. The
third hidden layer output vector of UBM-AANN corresponding to this input is denoted
with hi,l,s . The loss function (3.5) is minimized to obtain the speaker-specific weight matrix
Wl,s corresponding to lth session of sth speaker. Where is non-negative and controls the
amount of regularization.
n(l,s) h
L (Wl,s ) =
i
T
kfi,l,s b Wl,s hi,l,s k22 + tr Wl,s Wl,s
.
(3.5)
i=1
Differentiating the expression above with respect to Wl,s and setting it to zero yields,
L (Wl,s )
Wl,s
n(l,s) h
= 0
i
X
T
hi,l,s hTi,l,s + I Wl,s
hi,l,s (fi,l,s b)T = 0.
i=1
n(l,s)
Wl,s =
1
X
hi,l,s hTi,l,s + I .
(fi,l,s b) hTi,l,s
n(l,s)
i=1
i=1
36
(3.6)
Scores
3.5.2
T-matrix Training
3.5.3
i-vectors
3.5.4
PLDA training
PLDA is a generative model for observations [66, 67], in our case i-vectors. The i-vectors
are assumed to be generated as
ql,s = + s + l,s ,
(3.7)
Gender-specific PLDA models are trained using the same development data that
37
is used for training T matrices (see Section 3.5.2). The maximum likelihood estimates of
the model parameters {, , } are obtained using an Expectation Maximization (EM)
algorithm [66].
3.5.5
Hypothesis Testing
Given two i-vectors q1 , q2 of a speaker verification trial, we need to test whether they
belong to the same speaker (Hs ) or different speakers (Hd ). For the Gaussian PLDA above,
the log-likelihood ratio can be computed in a closed-form as
score = log
p (q1 , q2 |Hs )
p (q1 |Hd ) p (q2 |Hd )
(3.8)
T
T
q1 +
N ; ,
q2
T
T +
= log
,
T
0
q1 +
N ; ,
q2
0
T +
where N (.; , ) is a multivariate Gaussian density with mean and covariance . The
above score can be computed efficiently as described in [64, 68].
3.6
Experimental Results
Speaker verification systems are tested on telephone conditions of NIST-2008
speaker recognition evaluation (SRE). Table 3.1 shows the description of various telephone
conditions. Table 3.2 lists the EER and minimum detection cost function (minDCF) of
NIST-2008 for baseline AANN (see section 3.3) and WLS based AANN (see section 3.5)
speaker verification systems. These neural network based systems use the same UBM38
C7
C8
Table 3.2: EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8
of NIST-08.
System
C6
C7
C8
Baseline AANN
19.5 (87.3)
15.8 (73.0)
15.2 (74.8)
PCA of AANNs, = 0
13.4 (67.6)
8.7 (41.4)
7.9 (39.4)
12.2 (66.2)
7.2 (38.3)
6.4 (35.6)
10.7 (59.6)
5.5 (28.3)
4.4 (24.1)
AANN of size (39, 20, 6, 39, 39), where each number indicates the number of nodes in a
corresponding layer.
The error rates of the baseline AANN system are indicated in a first row of Table
3.2. In this case, the difference between average reconstruction errors of a UBM and a given
39
speaker-specific model is used as a score for making decision [11]. The second row of the table
lists the error rates when PCA is applied instead of WLS to reduce the dimensionality of
the speaker-specific weights. The resultant PCA based i-vectors are modeled using PLDA.
The next two rows of Table 3.2 shows the error rates of a WLS based AANN speaker
verification system that uses gender-dependent 150 dimensional (number of columns of )
subspace PLDA models in 240 dimensional i-vector space. The speaker-specific weights
Wl,s are derived using (3.6) for different values of .
These results suggest that the proposed AANN based i-vector/PLDA framework
outperforms the baseline AANN speaker verification system. Additionally, the proposed
WLS formulation for obtaining i-vectors yields better results than the simple PCA. Moreover, regularized speaker-specific weights ( = 0.005) yield much better results. It can be
observed that a relative improvement of 59.2% in EER and 52.3% in minDCF is obtained
using the WLS based AANN system over the baseline AANN system.
40
Chapter 4
Factor Analysis of
Auto-Associative Neural Networks
for Speaker Verification
4.1
Chapter Outline
The main disadvantage of WLS of AANNs is that it requires speaker-specific adap-
tation of the weight matrix connecting third hidden and output layers despite a much
low-dimensional representation (i-vector) is being used (see section 3.4) for the subsequent
modeling. In this chapter, we introduce and develop the factor analysis (FA) theory of
AANNs to alleviate this problem. This is achieved by regularizing each speaker-specific
weight matrix by restricting it to a common low-dimensional subspace during adaptation.
The subspace is learned using large amounts of development data, and is held fixed during
41
adaptation. Thus, only the coordinates in a subspace, also known as i-vector, need to be
estimated using the speaker-specific data. Unlike the WLS of AANNs approach, we adapt
the weight matrix directly in a low-dimensional common subspace. The update equations
are derived for learning both the common low-dimensional subspace and the i-vectors corresponding to speakers in the subspace. The resultant i-vector representation is used as
features for the subsequent PLDA model. Proposed system shows promising results on the
NIST-08 SRE, and yields 12% relative improvement in EER over the WLS based AANN
speaker verification system described in section 3.5.
The remainder of the chapter is organized as follows. In section 4.2, the factor
analysis (FA) of AANNs is developed. The FA based AANN speaker verification system is
described in section 4.3. Experimental results are provided in section 4.4.
4.2
and output layer of each speaker-specific AANN model to lie on a common low-dimensional
subspace such that it minimizes the overall loss function over the entire development data.
In this process, the rest of the parameters of each AANN are held fixed at values learned
during speaker independent training such as UBM (see section 3.3.2). The proposed loss
function is different from the one used in section 3.5. The notations used in this section are
summarized below.
m - number of speakers
42
In the AANNs loss function below, we first vectorize Wl,s so that a subspace
structure can be imposed on it. The loss function with speaker and session specific weights
is given by
L =
l(s) n(l,s)
m X
X
X
l(s) n(l,s)
m X
X
X
43
(4.1)
where,
wl,s = Row ordered(Wl,s ),
Hi,l,s = Id (hi,l,s )T
(hi,l,s )
0
.
.
(hi,l,s )T
ddd0
The dimensionality of Wl,s is d d0 and that of wl,s is dd0 1. The vector wl,s is obtained
by arranging rows of Wl,s as columns one after the other. Matrix Hi,l,s is equal to the
Kronecker product of Id (a d d identity matrix) and (hi,l,s )T .
l(s) n(l,s)
m X
X
X
(4.2)
The factor analysis model (or subspace constraint) for the vectorized weights wl,s
is
wl,s wubm + T ql,s ,
where wubm represents the speaker independent (UBM) vector of weights connecting third
hidden layer and output layer, T is a matrix having fewer columns than rows representing
the common low-dimensional subspace, and ql,s is a vector of coordinates in the subspace
or an i-vector associated with the lth session of sth speaker. By substituting this factor
44
analysis model in (4.2), (note that the loss function depends only on (T, {ql,s }) as we know
wubm )
l(s) n(l,s)
m X
X
X
L T, q1,1 , . . . , ql(m),m =
[fi,l,s b Hi,l,s (wubm + Tql,s )]T
s=1 l=1 i=1
(4.3)
(4.4)
X
L T, q1,1 , . . . , ql(m),m =
[ei,l,s Hi,l,s Tql,s ]T [ei,l,s Hi,l,s Tql,s ] .
(4.5)
(4.6)
i=1
n(l,s)
. X T
F2 (l, s) =
Hi,l,s Hi,l,s
(4.7)
i=1
L T, q1,1 , . . . , ql(m),m =
l(s)
n(l,s)
m X
X
X
(4.8)
i=1
UBMAANN
training
FDLP
features
PLDA
training
T matrix
training
ivector
extraction
Extract
statistics
Hypothesis
testing
Figure 4.1: Block schematic of the proposed FA based AANN speaker verification system.
step, and therefore the optima is found by setting the gradient of the loss function with
respect to the corresponding variable to zero. These steps are repeated until convergence.
L
= 0 2TT F1 (l, s) + 2TT F2 (l, s) Tql,s = 0
ql,s
1 T
ql,s = TT F2 (l, s) T
T F1 (l, s)
(4.9)
XX
L
2F1 (l, s) qTl,s + 2F2 (l, s) Tql,s qTl,s = 0,
=0
T
(4.10)
s=1 l=1
where we solve for T by solving a set of linear equations involving entries of the matrix T.
4.3
The block diagram of the proposed FA based AANN speaker verification system
is shown in the Fig. 4.1. This system resembles the WLS based AANN speaker verification
system described in section 3.5. These systems differ in the way UBM-AANN is adapted. In
the FA approach, the matrix connecting third hidden and output layers of a UBM-AANN
46
Scores
4.3.1
Statistics
The statistics in (4.6) and (4.7) are precomputed for each utterance that corresponds to a
particular speaker and a session. Appropriate gender-specific UBM is used for computing
the statistics. Note that we need to compute only few entries of F2 (l, s) as its entries
are redundant. These statistics are sufficient for training the T matrix and extracting the
i-vectors.
4.3.2
T-matrix Training
Gender dependent low-dimensional subspaces (T matrices) are trained as described in Section 4.2. The development data for training the subspaces consists of Switchboard-2, Phases
II and III; Switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE [62]. The total
number of male and female utterances is 12266 and 14936 respectively. We initialize the
T matrix with a Gaussian noise and learn the subspace as described in Section 4.2 using
coordinate gradient descent.
47
C6
C7
C8
Baseline AANN
19.5 (87.3)
15.8 (73.0)
15.2 (74.8)
10.7 (59.6)
5.5 (28.3)
4.4 (24.1)
9.6 (55.1)
4.7 (25.3)
3.8 (20.1)
4.3.3
i-vectors
4.4
Experimental Results
Speaker verification systems are tested on telephone conditions of NIST-2008
speaker recognition evaluation (SRE). The details of baseline AANN and WLS based AANN
speaker verification systems can be found in section 3.6. All neural network based systems
use the same UBM-AANN of size (39, 20, 6, 39, 39), where each number indicates the number of nodes in a corresponding layer1 . Table 3.1 shows the description of various telephone
1
The number of nodes in each hidden layer is optimized one at a time on NIST-08 to obtain the best
performance of the FA based AANN system. Where as, the number of nodes in input or output layers is
fixed by the dimensionality of feature vectors.
48
C6
C7
C8
9.6 (55.1)
4.7 (25.3)
3.8 (20.1)
7.0 (41.3)
2.8 (14.8)
2.1 (10.8)
conditions. Table 4.1 lists the EER and minimum detection cost function (minDCF) of
NIST-2008 for all the systems. The last row of the table shows the error rates of the FA
based AANN system which uses gender-dependent 150 dimensional (number of columns
of ) subspace PLDA models in 240 dimensional i-vector space. Results indicate that
FA based AANN speaker verification system outperforms baseline AANN and WLS based
AANN speaker verification systems in all conditions. It can also be observed that the
proposed FA approach yields 12.1% relative improvement in EER and 10.2% relative improvement in minDCF over the best WLS based AANN system.
4.4.1
system, except the UBM is based on a GMM. Each GMM based UBM consists of 1024 mixture components with diagonal covariances. The male and female UBMs are trained using
FDLP features extracted from 4324 and 5461 utterances of development data respectively.
Gender-specific 400 dimensional total variability space (T matrix) is trained as described
49
in [63]. The i-vectors of this space are length normalized and subsequently used for training
a gender-dependent PLDA system with 250 dimensional subspace. Note that the development data used for training the T matrices and the PLDA models is same as that of the
FA based AANN speaker verification system (see section 4.3).
A gender-specific GMM based i-vector/PLDA system is also trained for comparison. The results are shown in Table 4.2. Although significant improvements are achieved
using the proposed FA of AANNs, GMM based i-vector/PLDA system continues to perform
better. However, further work on neural network based systems might close the existing
performance gap, and bring forward possible advantages of this alternative nonlinear neural
network based modeling in speaker verification.
50
Chapter 5
Conclusions
5.1
Conclusions
Discriminative MLDA features were proposed for phoneme recognition in section
2.3. The SMLP classifier was proposed in section 2.4.1. It has been shown that one of
its hidden layer outputs can be forced to be sparse by adding a sparse regularization term
to the cross-entropy cost function. Update equations were derived for training the SMLP.
Finally, a multi-stream phoneme recognition system based on SMLP has been shown to
outperform its MLP counterpart.
The closed-form expression for adapting the AANN weight matrix connecting the
third and output layers with regularization was derived in chapter 3. It was further regularized by projecting it onto a low-dimensional subspace, which was learned to preserve most
of the variability of the adapted weight matrices in a WLS sense. Each speaker was modeled
using a projection in the subspace (an i-vector). The resultant speaker verification system
51
CHAPTER 5. CONCLUSIONS
based on i-vectors achieved an order of magnitude better performance than the existing
AANN based speaker verification system.
The theory of FA of AANNs was introduced to directly adapt the AANN weight
matrix connecting the third and output layers in a low-dimensional subspace, that captures
variability of weight matrices (see section 4.2). This particular way of regularizing adaptation parameters of AANNs has been shown to yield better performance than the WLS
approach described in section 3.4.
5.2
Future Work
The future work includes applying SMLP to extract data-driven features for ASR
in TANDEM framework. Also, to replace MLP with SMLP in other pattern classification
applications.
52
Appendix A
(A.1)
yjm
yjm
ex j
m
m
m = yj (Iij yi ) ,
N
m
x
P
m
i
exk
(A.2)
k=1
where,
Iij
if i == j,
otherwise.
m
X
L
=
xm
i
j=1
L
yjm
yjm
xm
i
= yjm dj
(A.3)
53
APPENDIX A.
Nm
X
L
=
xm
i
i=1
Nm
X
xm
i
yjm1
m1
(yim di ) wji
.
(A.4)
i=1
Nl
X
j=1
Nl
X
j=1
Nl
X
j=1
L
yjl
L
yjl
L
yjl
yjl
xlj
xlj
yil1
0
l1
l xlj wij
!
l1
yjl (1 yjl ) wij
where,
yjl = l xlj =
1
.
1 + exp(xlj )
54
Bibliography
[1] J. Dines, J. Vepa, and T. Hain, The segmentation of multi-channel meeting recordings
for automatic speech recognition, in INTERSPEECH, 2006.
[2] M. Lehtonen, P. Fousek, and H. Hermansky, Hierarchical approach for spotting keywords, IDIAP Research Report, no. 0541, 2005.
[3] J. Pinto, I. Szoke, S. Prasanna, and H. Hermansky, Fast approximate spoken term
detection from sequence of phonemes, in Proceedings of the ACM SIGIR Workshop
on Searching Spontaneous Conversational Speech, 2008, pp. 0845.
[4] H. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach.
Kluwer Academic Pub, 1994.
[5] H. Hermansky, D. Ellis, and S. Sharma, Tandem connectionist feature extraction for
conventional hmm systems, in Proc. of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2000.
[6] B. Chen, Q. Zhu, and N. Morgan, Learning long-term temporal features in lvcsr using
neural networks, in INTERSPEECH, 2004.
55
BIBLIOGRAPHY
56
BIBLIOGRAPHY
[15] B. Kingsbury, N. Morgan, and S. Greenberg, Robust speech recognition using the
modulation spectrogram, Speech Communication, vol. 25, no. 1, pp. 117132, 1998.
[16] H. Hermansky and P. Fousek, Multi-resolution RASTA filtering for TANDEM-based
ASR, in INTERSPEECH, 2005.
[17] S. Ganapathy, S. Thomas, and H. Hermansky, Modulation frequency features for
phoneme recognition in noisy speech, The Journal of the Acoustical Society of America
- Express Letters, vol. 125, no. 1, pp. 812, 2009.
[18] M. Kleinschmidt and D. Gelbart, Improving word accuracy with Gabor feature extraction, in Proc. of ICSLP.
USA, 2002.
[19] S. Zhao and N. Morgan, Multi-stream spectro-temporal features for robust speech
recognition, in INTERSPEECH. Brisbane, Australia, 2008.
[20] N. Mesgarani, G. Sivaram, S. K. Nemala, M. Elhilali, and H. Hermansky, Discriminant
Spectrotemporal Features for Phoneme Recognition, in INTERSPEECH. Brighton,
2009.
[21] M. Richard and R. Lippmann, Neural network classifiers estimate Bayesian a posteriori probabilities, Neural computation, vol. 3, no. 4, pp. 461483, 1991.
[22] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf,
P. Jain, H. Hermansky, D. Ellis et al., Pushing the envelope-aside [speech recognition], IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 8188, 2005.
[23] V. Balakrishnan, G. Sivaram, and S. Khudanpur, Dirichlet mixture models of neural
57
BIBLIOGRAPHY
58
BIBLIOGRAPHY
59
BIBLIOGRAPHY
learning, Advances in neural information processing systems, vol. 21, pp. 10331040,
2008.
[40] T. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, Bayesian compressive
sensing for phonetic classification, in Proc. of International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2010.
[41] J. Gemmeke and T. Virtanen, Noise robust exemplar-based connected digit recognition, in Proc. of International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2010.
[42] G. Sivaram, S. Nemala, M. Elhilali, T. Tran, and H. Hermansky, Sparse coding for
speech recognition, in Proc. of International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2010.
[43] J. Pinto, B. Yegnanarayana, H. Hermansky, and M. Magimai.-Doss, Exploiting contextual information for improved phoneme recognition, Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2008.
[44] J. Pinto, G. Sivaram, M. Magimai.-Doss, H. Hermansky, and H. Bourlard, Analyzing
MLP Based Hierarchical Phoneme Posterior Probability Estimator, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 225241, 2011.
[45] H. Ketabdar and H. Bourlard, Enhanced phone posteriors for improving speech
recognition systems, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 10941106, 2010.
60
BIBLIOGRAPHY
[46] J. Pinto, M. Magimai-Doss, and H. Bourlard, Mlp based hierarchical system for task
adaptation in asr, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2009, pp. 365370.
[47] D. Imseng, M. Doss, and H. Bourlard, Hierarchical multilayer perceptron based language identification, in INTERSPEECH, 2010.
[48] The
ICSI
Quicknet
Software
Package,
Available:
http://www.icsi.berkeley.edu/Speech/qn.html.
[49] F. Valente, Multi-stream speech recognition based on Dempster-Shafer combination
rule, Speech Communication, vol. 52, no. 3, pp. 213222, 2010.
[50] G. Shafer, A mathematical theory of evidence.
1976, vol. 1.
[51] G. Dahl, M. Ranzato, A. Mohamed, and G. Hinton, Phone recognition with the meancovariance restricted boltzmann machine, Advances in Neural Information Processing
Systems, vol. 23, pp. 469477, 2010.
[52] A. Mohamed and G. Hinton, Phone recognition using restricted boltzmann machines, in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010.
[53] P. Schwarz, P. Matejka, and J. Cernocky, Hierarchical structures of neural networks
for phoneme recognition, in Proc. of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2006.
61
BIBLIOGRAPHY
[54] P. Hoyer, Non-negative matrix factorization with sparseness constraints, The Journal
of Machine Learning Research, vol. 5, pp. 14571469, 2004.
[55] M. Kramer, Nonlinear principal component analysis using autoassociative neural networks, AIChE journal, vol. 37, no. 2, pp. 233243, 1991.
[56] R. Kumerasan and A. Rao, Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications, The Journal of the
Acoustical Society of America, vol. 105, pp. 19121924, 1999.
[57] M. Athineos, H. Hermansky, and D. Ellis, Plp2 autoregressive modeling of auditorylike 2-d spectrotemporal patterns, in INTERSPEECH, 2004.
[58] M. Athineos and D. Ellis, Autoregressive modeling of temporal envelopes, IEEE
Transactions on Signal Processing, vol. 55, no. 11, pp. 52375245, 2007.
[59] S. Ganapathy, J. Pelecanos, and M. Omar, Feature normalization for speaker verification in room reverberation, in Proc. of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2011.
[60] D. Reynolds, T. Quatieri, and R. Dunn, Speaker verification using adapted gaussian
mixture models, Digital signal processing, vol. 10, no. 1-3, pp. 1941, 2000.
[61] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Transactions on Audio, Speech, and
Language Processing, vol. 15, no. 4, pp. 14351447, 2007.
[62] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny, Comparison of scoring
62
BIBLIOGRAPHY
methods used in speaker recognition with joint factor analysis, in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009.
[63] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor
analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, no. 4, pp. 788798, 2011.
[64] N. Br
ummer and E. de Villiers, The speaker partitioning problem, in Proceedings
of the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic,
2010.
[65] D. Garcia-Romero and C. Espy-Wilson, Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries, in Proc. Odyssey
Speaker and Language Recognition Workshop, 2010.
[66] S. Prince and J. Elder, Probabilistic linear discriminant analysis for inferences about
identity, in IEEE 11th International Conference on Computer Vision, 2007., 2007,
pp. 18.
[67] P. Kenny, Bayesian speaker verication with heavy-tailed priors, in Proceedings of the
Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010.
[68] D. Garcia-Romero and C. Espy-Wilson, Analysis of i-vector length normalization in
speaker recognition systems, in INTERSPEECH, 2011.
63