Garimella Thesis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 74

Alternative Regularized Neural Network Architectures for

Speech and Speaker Recognition

by

Sri Venkata Surya Garimella

A dissertation submitted to The Johns Hopkins University in conformity with the


requirements for the degree of Doctor of Philosophy.

Baltimore, Maryland
July, 2012

c Sri Venkata Surya Garimella 2012

All rights reserved

Abstract
Artificial Neural Networks (ANNs) have been widely used in a variety of
speech processing applications. They can be used either in a classification or regression mode. Proper regularization techniques are necessary when training these
networks, especially in scenarios where the amount of training data is limited or the
number of layers in a network is large. In this thesis, we explore alternative regularized feed-forward neural network architectures and propose learning algorithms for
speech processing applications such as phoneme recognition and speaker verification.
In a conventional hybrid phoneme recognition system, a multilayer perceptron (MLP) with a single hidden layer is trained on standard acoustic features to
provide the estimates of posterior probabilities of phonemes. These estimates are
used for decoding the underlying phoneme sequence. In this thesis, we introduce a
sparse multilayer perceptron (SMLP) which jointly learns an internal sparse feature
representation and nonlinear classifier boundaries to discriminate multiple phoneme
classes. This is achieved by adding a sparse regularization term to the original crossentropy cost function. Instead of MLP, SMLP is used in a hybrid phoneme recognition
ii

ABSTRACT

system. Experiments are conducted to test various feature representations, including the proposed data-driven discriminative spectro-temporal features. Significant
improvements are obtained using these techniques.
Another application where neural networks are used is in speaker verification. Auto-Associative Neural Network (AANN) is a fully connected feed-forward
neural network, trained to reconstruct its input at its output through a hidden compression layer. AANNs are used to model speakers in speaker verification, where a
speaker-specific AANN model is obtained by adapting (or retraining) the Universal
Background Model (UBM) AANN, an AANN trained on multiple held out speakers, using corresponding speaker data. When the amount of speaker data is limited,
this procedure may lead to overfitting as all the parameters of UBM-AANN are being adapted. To alleviate this problem, we regularize the parameters of AANN by
developing subspace methods namely weighted least squares (WLS) and factor analysis (FA). Experimental results show the effectiveness of the subspace methods over
directly adapting a UBM-AANN for speaker verification.
Thesis Committee
Hynek Hermansky, Sanjeev Khudanpur, Trac Tran, Daniel Povey, Nelson Morgan.

iii

Acknowledgments
I owe my deepest gratitude to my supervisor Prof. Hynek Hermansky, whose
insight, encouragement and guidance made this work possible. It is an honor for me
to work with him.
I am indebted to Andrea Ridolfi, Sanjeev Khudanpur, Trac Tran, Donniell
Fishkind, Carey Priebe, Rene Vidal and Daniel Povey for offering graduate level
courses that enabled me to learn and appreciate mathematical rigour. I would like to
thank my internship hosts Pedro Moreno, Olivier Siohan at Google for providing me
the opportunity to gain valuable research experience.
I am grateful to Joel Pinto, Nima Mesgarani, Sridhar Krishna Nemala, Sriram Ganapathy, Balakrishnan Varadarajan and Samuel Thomas for the scientific
collaboration, and also several colleagues and friends for their support during the
PhD.
Finally, I would like to thank my mother and sisters for their infinite love
and support.

iv

Dedication
This thesis is dedicated to my mother.

Contents

Abstract

ii

Acknowledgments

iv

List of Tables

List of Figures

xi

1 Introduction

1.1

Multilayer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

Scope of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Improved Hybrid Phoneme Recognition System

2.1

Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Discriminative Spectro-Temporal Features . . . . . . . . . . . . . . . . . . .

vi

CONTENTS

2.4

2.3.1

Estimation of Spectro-Temporal Filters . . . . . . . . . . . . . . . .

10

2.3.2

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Sparse Multilayer Perceptron (SMLP) . . . . . . . . . . . . . . . . . . . . .

12

2.4.1

Theory of SMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

Error Back-Propagation Training . . . . . . . . . . . . . . . . . . . .

14

Gradient of L w.r.t. yjl . . . . . . . . . . . . . . . . . . . . . . . . . .

15

l1
Gradient of L w.r.t. wij
. . . . . . . . . . . . . . . . . . . . . . . .

16

Update Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

SMLP as a Posterior Probability Estimator . . . . . . . . . . . . . .

16

System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.5.1

Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.5.2

Feature Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

PLP Cepstral Coefficients . . . . . . . . . . . . . . . . . . . . . . . .

18

FDLP Temporal Features . . . . . . . . . . . . . . . . . . . . . . . .

19

MLDA Spectro-Temporal Features . . . . . . . . . . . . . . . . . . .

19

Hierarchical SMLP . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

Estimation of 3-state Phoneme Posteriors . . . . . . . . . . . . . . .

20

Hierarchical Estimation of Posterior Probabilities . . . . . . . . . . .

22

2.5.4

Dempster-Shafer Combination . . . . . . . . . . . . . . . . . . . . .

22

2.5.5

Hybrid HMM Decoding . . . . . . . . . . . . . . . . . . . . . . . . .

23

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.4.2
2.5

2.5.3

2.6

vii

CONTENTS

2.7

Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3 Weighted Least Squares based Auto-Associative Neural Networks for


Speaker Verification

27

3.1

Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.2

Auto-Associative Neural Networks . . . . . . . . . . . . . . . . . . . . . . .

29

3.3

Speaker Verification using AANNs . . . . . . . . . . . . . . . . . . . . . . .

30

3.3.1

Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3.2

UBM-AANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.3.3

Speaker-Specific AANN . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.3.4

Score Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.4

Weighted Least Squares of AANNs . . . . . . . . . . . . . . . . . . . . . . .

33

3.5

WLS based AANN Speaker Verification System . . . . . . . . . . . . . . . .

35

3.5.1

Closed-form Expression for Adapting UBM-AANN . . . . . . . . . .

36

3.5.2

T-matrix Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.3

i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.4

PLDA training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.5.5

Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.6

4 Factor Analysis of Auto-Associative Neural Networks for Speaker Verification

41

4.1

Chapter Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

4.2

Factor analysis of AANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

viii

CONTENTS

4.3

4.4

FA based AANN Speaker Verification System . . . . . . . . . . . . . . . . .

46

4.3.1

Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.3.2

T-matrix Training . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.3.3

i-vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.4.1

49

Comparison with GMM based i-vector/PLDA System . . . . . . . .

5 Conclusions

51

5.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

5.2

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

Standard Error Back-Propagation

53

Bibliography

55

ix

List of Tables
2.1

2.2
2.3

2.4
3.1
3.2
4.1
4.2

PER (in %) on TIMIT test set for various acoustic feature streams using hierarchy of multilayer perceptrons. Last column indicates the results of feature
stream combination at the hierarchical posterior level using the DempsterShafer theory of evidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Average measure of sparsity of the first hidden layer outputs of SMLP and
four layer MLP for various feature streams. . . . . . . . . . . . . . . . . . .
PER (in %) on TIMIT test set for various acoustic feature streams when
the 3-state phoneme posteriors are obtained using a single layer perceptron
trained on the first hidden layer outputs of SMLP or MLP classifier. . . . .
PER (in %) on TIMIT test set using a single multilayer perceptron (without
hierarchy). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Description of various telephone conditions of NIST-08. . . . . . . . . . . .
EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and
C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and
C8 of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison with the state-of-art GMM based i-vector/PLDA system. EER
in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8
of NIST-08. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24
25

26
26
39
39
48

49

List of Figures
1.1
1.2

Feed-forward multilayer perceptron. . . . . . . . . . . . . . . . . . . . . . .


Block diagram depicting the thesis organization. . . . . . . . . . . . . . . .

2.1
2.2
2.3
2.4

Block diagram of a hybrid phoneme recognition system. . . . . . . . .


Steps involved in extracting spectro-temporal representation. . . . . .
Block diagram of the phoneme recognition system. . . . . . . . . . . .
Block diagram of hierarchical SMLP. Though both the networks are
connected, only a portion of the connections are shown for clarity. . .

2
5

. . .
. . .
. . .
fully
. . .

8
10
17

3.1
3.2
3.3

Block schematic of the AANN based speaker verification system. . . . . . .


Auto-Associative Neural Network. . . . . . . . . . . . . . . . . . . . . . . .
WLS based AANN speaker verification system. . . . . . . . . . . . . . . . .

28
29
36

4.1

Block schematic of the proposed FA based AANN speaker verification system. 46

xi

21

Chapter 1

Introduction
Artificial Neural Networks (ANNs) are used in many speech processing applications
such as speech activity detection (SAD) [1], keyword spotting (KS) [2, 3], automatic speech
recognition (ASR) [48], speaker verification (SV) [911] and language identification (LID)
[12] etc. Most of these applications use a feed-forward multilayer perceptron (MLP) which
is described in section 1.1.

1.1

Multilayer Perceptron
MLP is a fully connected feed-forward neural network with multiple hidden layers,

depicted in Fig. 1.1. Each node in any layer (except output) is connected to every other
node in the subsequent layer using a set of weights. The output of any node is obtained by
applying a specified transformation (linear or non-linear) on the sum of weighted combination of previous layer outputs and the node-specific bias value. Thus an MLP transforms
inputs to outputs using a set of weights and biases (parameters).

CHAPTER 1. INTRODUCTION

Hidden
layers
Input
layer

Output
layer

Output

Input

Figure 1.1: Feed-forward multilayer perceptron.

Let {W1 , W2 , . . . , Wm1 }, {b1 , b2 , . . . , bm1 } be the set of weight matrices and
bias vectors respectively representing parameters of an MLP, where Wi indicates the weights
connecting ith layer and (i + 1)th layer, and bi indicates the bias vector of (i + 1)th layer.
The (element-wise) non-linearity applied at the ith layer be i . Typically, a sigmoid nonlinearity is used at the hidden layers, and a linear or softmax is applied at the output layer.
For an input vector f , the MLP output vector o (f ) can be expressed as,

o (f ) = m Wm1 . . . 3 W2 2 W1 f + b1 + b2 . . . + bm1 .

CHAPTER 1. INTRODUCTION

This output must be close to the desired vector d (f ) which depends on whether
the task is classification or regression. In either cases, MLP is trained to minimize the
discrepancy between its output vector o (f ) and the desired vector d (f ) over the training
data. Due to the high non-linear dependency between input and output, MLP is trained
using the stochastic gradient descent. The gradient can be computed efficiently using the
error back-propagation algorithm. However, proper regularization on the parameters of an
MLP is desirable when the amount of training data is limited or the number of layers in
a network is large. Such an alternative regularized neural network architectures require
learning algorithms different from existing methods. The focus of this thesis is to develop
such algorithms and apply them to applications such as speech and speaker recognition.

1.2

Scope of the Work


The learning algorithms for training regularized neural networks are generic, and

applicable to other applications as well. For instance, MLP can be trained to extract datadriven discriminative features. These features are used in ASR as described in TANDEM
framework [5]. The subspace methods developed in this thesis may form the basis for
adapting MLPs using limited amount of adaptation data. The underlying principles seem
to be attractive for speaker-specific adaptation of neural network based acoustic model for
ASR. Additionally, neural network speaker verification systems based on subspace methods
yield performances comparable to the state-of-the-art systems. This may facilitate further
research on the application of neural networks in speaker verification.

CHAPTER 1. INTRODUCTION

1.3

Contributions
The contributions of this thesis are :

1. Proposed MLDA features for recognizing phonemes.


2. Introduced SMLP classifier, which encourages its hidden layer outputs to be sparse.
Further, SMLP is used to build a phoneme recognition system.
3. Developed regularized WLS based AANN speaker verification system.
4. Developed FA for adapting the parameters of an AANN in a low-dimensional subspace,
and applied it in speaker verification.

1.4

Organization of the Thesis

The block diagram depicting the organization of this thesis is shown in Fig. 1.2. The core of
this thesis is to develop learning algorithms for training various regularized neural networks
(MLPs). A brief description summarizing the focus of various chapters is provided below.

In chapter 2, an MLP with multiple hidden layers is used as a classifier to classify


phonemes. Specifically, a sparse multilayer perceptron (SMLP) is developed which encourages a particular hidden layer outputs to be sparse. This is achieved by adding a sparse
regularization term to the original cross-entropy cost function. This chapter also proposes
to use multiple linear discriminant analysis (MLDA) based features. A phoneme recogni-

CHAPTER 1. INTRODUCTION

Speaker
Verification

FA
Adaptation

AANN
Regression

MLP

Phoneme
Recognition

SMLP
Classifier

AANN
Regression

Chapter2

Chapter4
WLS
Adaptation

Chapter3

Speaker
Verification

Figure 1.2: Block diagram depicting the thesis organization.

tion system built using the SMLP and various feature representations shown to perform
significantly better than the MLP counterpart.

Chapters 3 and 4 use MLP (specifically, Auto-Associative Neural Network (AANN))


as a regression module. The problem of adapting AANNs using limited amount of data is
addressed in these chapters. In chapter 3, a closed-form expression is derived for the adaptation parameters of an AANN with regularization. These parameters are further regularized
by projecting onto a subspace in a weighted least squares (WLS) sense. These techniques
have shown to improve the performance of a speaker verification system.

In chapter 4, all adaptation parameters of AANNs are regularized by restricting them to a common low-dimensional subspace using factor analysis (FA) technique. A

CHAPTER 1. INTRODUCTION

speaker verification system based on FA yields better results than the WLS based AANN
speaker verification system developed in chapter 3. Our experiments also show that this
technique yields an order of magnitude better performance than the existing neural network
based speaker verification system.

Finally, the conclusions of this thesis along with the future work is provided in
chapter 5.

Chapter 2

Improved Hybrid Phoneme


Recognition System
2.1

Chapter Outline
In a conventional hybrid phoneme recognition system, shown in Fig. 2.1, a multi-

layer perceptron (MLP) with a single hidden layer is trained on standard acoustic features to
estimate the posterior probabilities of phonemes [4]. These estimates are used for decoding
the underlying phoneme sequence. First, we propose to derive a spectro-temporal feature
representation by applying multiple linear discriminant analysis (MLDA) technique. Second, we introduce a sparse multilayer perceptron (SMLP) which jointly learns an internal
sparse feature representation and nonlinear classifier boundaries to estimate the phoneme
posterior probabilities. Experimental results show that the proposed techniques improve
the phoneme error rate (PER).

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Speech

Hybrid HMM

Feature
extraction

MLP

Acoustic features

Phonemes

decoding

Phoneme posterior probabilities

Figure 2.1: Block diagram of a hybrid phoneme recognition system.

2.2

Background
The task of phoneme recognition is to convert a speech waveform into a sequence

of underlying sound units known as phonemes. The block diagram of a hybrid phoneme
recognition system consisting of feature extraction, MLP and hybrid hidden markov model
(HMM) decoding is shown in Fig. 2.1.

The purpose of feature extraction is to discard information irrelevant for performing the task. Features are usually extracted from the two-dimensional representation of
speech such as spectrogram. Depending on the manner in which features are derived, they
can be broadly classified into spectral [13,14], temporal [1517] or spectro-temporal [1820]
features.

These features are used as input to train an MLP. It has been shown [21] that
MLP, trained to minimize the cross-entropy cost between its outputs and hard targets1 using
sufficient amount of data, estimates the posterior probabilities of output classes conditioned
on the input feature vector in a discriminative manner. This has led to an extensive use of
1
A hard target vector consist of all zeros except a one at the index corresponding to the phoneme to
which current input feature vector belongs.

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

MLP in state-of-the-art automatic speech recognition systems [46, 2226].

In order to decode the underlying phoneme sequence, each phoneme is modeled


using a left-to-right HMM. The emission likelihood of each HMM state is computed from
the MLP posteriors. Viterbi algorithm is applied to decode the phoneme sequence.

2.3

Discriminative Spectro-Temporal Features


It is well known [27] that the information about speech sounds, such as phonemes,

is encoded in the spectro-temporal dynamics of speech.

Recently, there has been an

increased research effort in deriving the features that explicitly capture these dynamics [1820, 28]. Such an approach is primarily motivated by the spectro-temporal receptive
field (STRF) model for predicting the response of a cortical neuron to the input speech,
where STRF describes the two-dimensional spectro-temporal pattern to which the neuron
is most responsive to [29].

Most of the works so far have used the parametric two-dimensional Gabor filters
for extracting features. The parameters of the Gabor functions are optimized to improve
the recognition accuracy [18], [28] or grouped into low and high frequency modulations to
form various streams of information [19]. Even though multiple spectro-temporal feature
streams were formed and combined using MLPs in [19], it is difficult to interpret what
each feature stream is trying to achieve. We propose to extract features using a set of twodimensional filters to discriminate each phoneme from the rest of the phonemes as described
below [20, 30].

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Speech

STFT

Bark frequency
warping

log

Spectrotemporal
representation

Figure 2.2: Steps involved in extracting spectro-temporal representation.

2.3.1

Estimation of Spectro-Temporal Filters


Speech is represented in the spectro-temporal domain (log critical band energies)

for both learning the two-dimensional filter shapes and extracting the features. Fig. 2.2
shows the steps involved in extracting such a representation. This representation is obtained
by first performing a Short Time Fourier Transform (STFT) on the speech signal with an
analysis window of length 25 ms and a frame shift of 10 ms. The magnitude square values
of the STFT output in each window are then projected on to a set of overlapping positive
weight vectors in such a way that their centers are equally spaced on the Bark frequency
scale to obtain the spectral energies in various critical bands. Finally, the spectro-temporal
representation is obtained by applying a logarithm on critical band energies.

We use TIMIT database [31] to obtain spectro-temporal patterns corresponding


to each phoneme 2 . These patterns corresponding to a particular phoneme are derived from
the spectro-temporal representation of the training utterances by taking a context of 2N +1
frames centered on every frame that is labeled as this phoneme. In our experiments, each
spectro-temporal pattern has 19 critical bands (K) and 21 frames (2N + 1).

In order to derive 2-D filter shapes, we ask the question : what are the directions
2

The 61 hand-labeled phonemes are mapped to a standard set of 39 phonemes [32].

10

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

(patterns) along which the spectro-temporal patterns of a phoneme are well separated from
that of the rest of the phonemes? Fisher Linear Discriminant Analysis (FLDA) gives only
one optimal discriminating pattern for each phoneme (two-class problem) due to the rank
limitation of its between-class scatter matrix. The resultant low-dimensional feature space
(projections of a spectro-temporal pattern on these discriminating patterns) may hinder
classification performance.

Modified Linear Discriminant Analysis (MLDA) is a generalization of FLDA that


overcomes this limitation by modifying the between-class scatter matrix [33]. It defines the
between-class scatter matrix as the weighted sum of average sample to sample scatter. This
modification yields multiple solutions (or discriminating patterns) to the generalized eigen
vector problem which arises in the conventional FLDA.

We apply MLDA to learn multiple discriminating patterns to discriminate spectrotemporal patterns of a phoneme from that of the rest of the phonemes. MLDA features
are obtained by projecting a spectro-temporal patch onto these discriminating patterns as
described in section below. For instance, 13 discriminating patterns per phoneme would
yield a feature vector of length 13 39 = 507. Discriminating patterns can be interpreted
as spectro-temporal filters by flipping them in time around the center.

2.3.2

Feature Extraction

If S(n, k) denotes the spectro-temporal representation of the speech, and h(n, k) characterizes a discriminating pattern, then the corresponding feature f (n) at a particular time

11

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

n is extracted using (2.1).


f (n) =

N X
K
X

h(i, k) S(n + i, k)

(2.1)

i=N k=1

where, i and k denote the discrete time index (due to 10ms shift) and the critical band
index respectively. K represents the total number of critical bands, while the temporal
extent (context) of the 2-D filter is given by 2N + 1.

2.4

Sparse Multilayer Perceptron (SMLP)


Sparse representations are first observed in the visual area of the mammalian

cortex [34, 35], and recently, many pattern classification applications have made use of
sparse signal representations [3642]. Most of these methods treat sparse representation
as features and train an additional classifier for making decisions. However, in only a few
instances have sparse representations been optimized in conjunction with the classifier for
discriminative classification. Some of the previous works have attempted to address this
issue. For example, a two-class classification problem with a linear or bilinear classifier has
been considered in [39]. In a different work, Fishers linear discrimination criterion with
sparsity is used [36].

We propose to jointly learn both sparse features and nonlinear classifier boundaries that best discriminate multiple output classes. Specifically, we propose to learn sparse
features at the output of a hidden layer of a MLP trained to discriminate multiple output
classes. This is achieved by adding a sparse regularization term to the conventional crossentropy cost between the target values and their predicted values at the output layer. The
12

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

parameters of the MLP are learned to minimize the joint cost using the standard backpropagation algorithm which takes the additional sparse regularization term into consideration. The resultant model is referred to as the sparse multilayer perceptron (SMLP). Further, under certain conditions, described in section 2.4.2, the SMLP estimates the Bayesian
a posteriori probabilities of the output classes conditioned on the sparse representation.

2.4.1

Theory of SMLP
The notations used are as follows.

m - number of layers (including input and output layers)


Nl - number of neurons (or nodes) in the lth layer
l - output nonlinearity at the lth layer
xlj - input to the j th neuron in the lth layer

l xlj = yjl - output of the j th neuron in the lth layer
l1
wij
- weight connecting the ith neuron in (l 1)th layer and j th neuron in lth layer

dj - target of the j th neuron in the output layer


.
ej = dj yjm - error of the j th neuron in the output layer

The goal of an SMLP classifier is to jointly learn sparse features at the output of its
pth layer and estimate posterior probabilities of multiple classes at its output layer. In the
case of MLP, estimates of the posterior probabilities are typically obtained by minimizing
the cross-entropy cost between the output layer values (after the softmax) and the hard
targets. We modify this cost function for SMLP as follows.

13

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Cost Function
The two objectives of the SMLP are minimize the cross-entropy cost between the output layer values and the hard targets,
and
force the outputs of the pth layer to be sparse for a particular p {2, 3, ..., m 1}.
The instantaneous 3 cross-entropy cost is
L=

Nm
X


dj log yjm .

(2.2)

j=1

To obtain the SMLP instantaneous cost function we add an additional sparse


regularization term to the cross-entropy cost (2.2), yielding
Np

X
log 1 + (yjp )2 .
L = L +
2

(2.3)

j=1

where is a positive scalar controlling the trade-off between the sparsity and the cross

Np
P
entropy cost. The function
log 1 + (yjp )2 which is continuous and differentiable everyj=1

where, was successfully used in previous works to obtain a sparse representation [34, 37, 42,
42]. The weights of the SMLP are adjusted to minimize (2.3), and is discussed below.

Error Back-Propagation Training


Stochastic gradient descent is applied for updating the SMLP weights. The conventional
error back-propagation training algorithm is a result of applying the chain rule of calculus
to compute the gradient of a cross-entropy cost (2.2) function with respect to the weights.
3

By instantaneous we mean corresponding to a single input pattern.

14

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

For training SMLP, the error back-propagation must be modified in order to accommodate
the additional sparse regularization term.

We derive update equations for training the SMLP by minimizing the cost function (2.3) with respect to weights4 over the training data. Since the learning is based on
stochastic gradient descent, the key is to determine the gradient of the cost function (2.3)
with respect to the weights.

Gradient of L w.r.t. yjl


From (2.2) and (2.3), l {p + 1, p + 2, ..., m}, j {1, 2, ..., Nl },
L
L
= l.
l
yj
yj

(2.4)

Using (2.3), for layer p, j {1, 2, ..., Np },


L
L
+
p =
yj
yjp

yjp

1 + (yjp )2

(2.5)

Using (2.3) and chain rule of calculus, (l 1) {2, 3..., p 1}, i {1, 2..., Nl1 },
L
yil1

Nl
X
j=1

Nl
X
j=1

L
yjl
L
yjl

yjl
xlj

xlj

yil1


0
l1
l xlj wij
.

(2.6)

The above equations (2.4),(2.5) and (2.6) indicate that the gradients of L w.r.t. yjl
can be computed from the gradients of L w.r.t. yjl . Specifically, we need the gradients of L
w.r.t. yjl , l {p, p + 1, ..., m}, j {1, 2, ..., Nl } in order to compute gradients of L w.r.t.
4

The bias values at any layer can be interpreted as weights connecting an imaginary node in the previous
layer, with its output being unity, and all the nodes in the current layer.

15

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM


yjl , l {2, 3, ..., m}, j {1, 2, ..., Nl }. The computation of these gradients is described in
Appendix A.

l1
Gradient of L w.r.t. wij
l1
By definition, wij
denotes the weight connecting the ith neuron in (l 1)th layer and j th

neuron in lth layer. Thus by using the chain rule,


L
l1
wij

L
yjl
L
yjl

!
!

yjl
xlj

xlj

l1
wij


0
l xlj yil1 .

(2.7)

Update Equations
SMLP weights are updated using stochastic gradient descent. The gradient of the cost
function (2.7) with respect to a particular weight is accumulated for several input patterns
known as bunch size and then the weight is updated using
l1
l1
wij
wij
h

where is a small positive learning rate, and h

2.4.2

L
l1 i
wij

L
i,
l1
wij

(2.8)

is the accumulated value of the gradient.

SMLP as a Posterior Probability Estimator


The number of input and output nodes of the SMLP is set to be equal to the di-

mensionality of its input acoustic feature vector and the number of output phoneme classes,
respectively. Softmax nonlinearity is used at its output layer, and weights are adjusted to
minimize (2.3) when the hard targets are being used. Note from the equations (2.4), (2.5),
(2.6) and (2.7) that the sparse regularization term affects the update of only those weights
16

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Hierarchical
PLP

Speech

FDLP

SMLP

DempsterShafer

Hierarchical

Combination

SMLP

Hybrid HMM
decoding

Phonemes

Hierarchical
MLDA

SMLP

Figure 2.3: Block diagram of the phoneme recognition system.


l , l {1, 2, ..., p 1}. This implies that the weights w l , l {p, p + 1, ..., m 1} can
wij
ij

be adjusted to minimize the cross-entropy term of (2.3) without affecting the sparse regularization term. If p < m 1 and one of the hidden layers between pth and mth layers
is sigmoidal (nonlinear) then the pth layer outputs can be nonlinearly transformed to the
SMLP outputs. Therefore, in such a case, SMLP estimates the posterior probabilities of
output classes conditioned on the pth layer outputs (sparse representation). This follows
from the fact that an MLP with a single nonlinear hidden layer estimates the posterior
probabilities of output classes conditioned on the input features [21], and SMLP outputs
are completely determined by the outputs of pth layer (hidden).

2.5

System Description
The block diagram of the phoneme recognition system used in our experiments is

shown in Fig. 2.3. The various components of the system are described below.

17

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.5.1

Database

Phoneme recognition experiments are conducted on the TIMIT database [31]. It consists
of 630 speakers with 10 utterances per speaker sampled at 16 kHz. The two SA dialect sentences per speaker are excluded from the setup as they are identical across all the speakers.
The original TIMIT train and test sets consist of 462 and 168 speakers respectively [31].
We further divide the original train set into training and validation sets having 425 and
37 speakers, and keep the original test set unchanged. Thus in all our experiments, the
training, validation and test sets consist of 3400, 296 and 1344 utterances from 425, 37 and
168 speakers, respectively.

2.5.2

Feature Streams
To test the proposed SMLP classifier, we developed system using three different

feature streams, namely PLP cepstral coefficients [13], FDLP temporal features [17] and
MLDA spectro-temporal features [20]. These features are extracted for every 10 ms of
speech, and they are normalized for speaker specific mean and variance. A detailed description of each feature stream is provided below.

PLP Cepstral Coefficients


Short Time Fourier Transform (STFT) is applied on the speech signal with an analysis
window of length 25 ms and a frame shift of 10 ms. The squared magnitude values of the
STFT output are then projected on a set of frequency weights which are equally spaced on
the Bark frequency scale to obtain the spectral energies in various critical bands. Transformations such as equal loudness and cubic root are applied for reducing the dynamic range.
18

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

The resultant spectral envelopes are smoothed by the twelfth order linear prediction analysis [13]. The top 13 cepstral coefficients are concatenated with the corresponding delta and
delta-delta features to obtain 39 dimensional feature vector. A nine frame context of these
vectors is used as the input PLP feature stream.

FDLP Temporal Features


Speech is transformed to the frequency domain by applying discrete cosine transform (DCT)
on the full utterance. The full band DCT signal is divided into multiple critical band DCT
signals by multiplying with a set of Gaussian windows centered on critical bands. Linear
prediction analysis is performed on each critical band DCT signal to obtain the smooth
sub-band temporal envelopes. These temporal envelopes are passed through nonlinearities
such as logarithmic and adaptive compression loops. The resultant compressed sub-band
envelopes are divided into 200 ms segments with a shift of 10 ms. DCT is applied on each
segment to derive the features. The first 14 DCT coefficients are concatenated from each
sub-band to form the FDLP temporal feature stream [17].

MLDA Spectro-Temporal Features


The description of MLDA features can be found in section 2.3. A set of spectro-temporal
discriminative patterns are designed using the modified linear discriminant analysis (MLDA)
to discriminate each phoneme from the rest of the phonemes. The number of discriminating
patterns per phoneme is chosen to be 13 to maximize the phoneme recognition accuracy
on validation data. Projections of a given spectro-temporal patch on these discriminative
patterns are concatenated to form MLDA feature stream [20, 30].
19

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.5.3

Hierarchical SMLP
Hierarchies of multilayer perceptron (MLP) classifiers have been shown to be useful

for acoustic modeling in speech recognition [4345], model adaptation [46] and language
identification [47]. A hierarchical MLP consists of two MLPs in series which are sequentially
trained. The first MLP uses standard acoustic feature vectors to estimate the posterior
probabilities of various output classes such as phonemes. The second MLP is then trained
on the same targets using long temporal spans of posterior probabilities estimated by the
first MLP as inputs.

Block diagram of the hierarchical SMLP is shown in Fig. 2.4. Initially, a four layer
SMLP is trained to estimate the 3-state phoneme posterior probabilities. Subsequently, another three layer MLP is trained on a long temporal span of these posteriors to estimate the
single state phoneme posterior probabilities. Both these networks are initialized randomly
using uniform noise and trained using back-propagation. We have modified the Quicknet
package [48] (software for MLP training) to perform SMLP training.

Estimation of 3-state Phoneme Posteriors


The 61 hand-labeled TIMIT phone symbols are mapped to 49 phoneme classes
by treating each of the following set of phonemes as a single class: {/tcl/, /pcl/, /kcl/},
{/gcl/, /dcl/, /bcl/}, {/h#/, /pau/}, {/eng/, /ng/}, {/axr/, /er/}, {/axh/, /ah/}, {/ux/, /uw/},
{/nx/, /n/}, {/hv/, /hh/}, and {/em/, /m/}.

20

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

MLP

SMLP

input

hidden
sparse
layer2
hidden
outputs

hidden
layer
output

input
temporal
context of
230 ms

Acoustic
features

output

Single state
phoneme
posterior
probabilities

(23 frames)

3state phoneme posterior probabilities

Figure 2.4: Block diagram of hierarchical SMLP. Though both the networks are fully connected, only a portion of the connections are shown for clarity.
As shown in Fig. 2.4, the SMLP used for estimating the 3-state phone posterior
probabilities consists of four layers (m = 4): an input layer to receive a given feature
stream, two hidden layers with a sigmoid nonlinearity, and an output layer with a softmax
nonlinearity. The number of nodes in the input and output layers is set to be equal to the
dimensionality of the input feature vector and the number of phoneme states (i.e., 49 x 3
= 147) respectively. The outputs of the first hidden layer (p = 2) are forced to be sparse
with number of nodes in it being same as that of the input layer. The number of nodes in
the second hidden layers is chosen to be 1000. For each feature stream, the value of in
the SMLP cost function (2.3) is chosen to minimize the phoneme error rate (PER) on the
validation data.

In the first pass of SMLP classifier training, 3-state hard phoneme targets are
obtained by segmenting each phoneme in the training data equally into three states i.e.,
start, middle and end. This classifier is retrained in a second pass using the hard targets
corresponding to the best state alignment obtained by applying the Viterbi algorithm on
21

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

3-state posterior probability estimates of the first pass. Frame classification accuracy on
the validation set is used to control the learning rate and to terminate training.

In order to gauge the effect of the sparse regularization term , an identically


configured four layer MLP with = 0 is also trained to estimate 3-state phoneme posteriors.
For an additional comparison, we also estimate the 3-state phoneme posteriors using a
conventional three layer MLP which has a sigmoid nonlinearity at the hidden layer and
a softmax nonlinearity at the output layer. The number of hidden layer nodes in this
system are chosen such that the total number of parameters match approximately that of
the SMLP.

Hierarchical Estimation of Posterior Probabilities


As shown in the Fig. 2.4, 3-state phoneme posterior probability estimates are mapped to
single state phoneme posterior probability estimates by training an MLP which operates
on a context of 230 ms or 23 posterior probability vectors. Its hidden layer consists of 3500
nodes with a sigmoid nonlinearity, and output layer consists of 49 nodes with a softmax
nonlinearity.

2.5.4

Dempster-Shafer Combination

Hierarchically estimated posterior probabilities corresponding to various feature streams are


combined as described in [49] using the Dempster-Shafer (DS) theory of evidence [50]. First
we combine posteriors of PLP and FDLP feature streams, and then the resultant posteriors
are combined with posteriors of MLDA feature stream.

22

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.5.5

Hybrid HMM Decoding

The 49 phoneme classes are mapped to 39 phoneme classes for decoding5 [32]. The posterior
probabilities of phoneme classes are converted to the scaled likelihoods by dividing them
by the corresponding prior probabilities of phonemes obtained from the training data. A
3-state HMM (connected from left to right) with equal self-transition and state transition
probabilities is used to model each phoneme. The emission likelihood of each state is
set to be the scaled likelihood. A bigram phonotactic language model is used in all the
experiments. Finally, the Viterbi algorithm is applied for decoding the phoneme sequence.
The PER is obtained by comparing the decoded phoneme sequence against the reference
sequence. While evaluating the performance on the test set, the language model scaling
factor is chosen to minimize the PER of the validation data.

2.6

Experimental Results
Table 2.1 shows the PER of the proposed SMLP based hierarchical hybrid system

and the baseline MLP based hierarchical hybrid systems for various feature streams on the
TIMIT test set. As described earlier in section 2.5.3, proposed and baseline systems differ
only in the way 3-state phoneme posteriors are estimated. Results indicate that four layer
MLP based system performs better than conventional three layer MLP based system for
each feature stream. Moreover, the SMLP based system outperforms the baseline four layer
MLP based system for each feature stream. This improved performance can be attributed
to the sparse regularization term.
5
The appropriate subsets of 49 phoneme posterior probability estimates are summed to get 39 phoneme
probability estimates.

23

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM


Table 2.1: PER (in %) on TIMIT test set for various acoustic feature streams using hierarchy
of multilayer perceptrons. Last column indicates the results of feature stream combination
at the hierarchical posterior level using the Dempster-Shafer theory of evidence.
PLP

FDLP

MLDA

PLP+FDLP+MLDA

MLP (3 layers)

22.9

23.2

22.8

20.5

MLP (4 layers)

22.6

22.8

22.4

20.1

SMLP (4 layers)

21.9

22.1

21.9

19.6

To exploit the complementary nature of these acoustic feature streams, they are
combined at the hierarchical posterior probability level as described in section 2.5.4. The
system combination results are shown in the last column of Table 2.1. It can be observed
that the combination of SMLP based systems yields a PER of 19.6%, a relative improvement
of 2.5% over the combination of four layer MLP (i.e., = 0) based systems. These results
are statistically significant with a p-value of less than 0.0004. Further, the information
transferred6 for SMLP and MLP (4 layers) based systems is 3.46 and 3.42 bits respectively.
On the TIMIT core test set consisting of 192 utterances (a subset of the test set provided
by LDC [31]), we obtain a PER of 20.7% using the combination of SMLP based systems.
This performance compares well with the existing state-of-the-art systems7 .
6

Computed for the confusion matrix of the test set.


Note that some of the TIMIT phoneme recognition systems use part of the original test set as a validation
set [40, 51, 52]. However, as mentioned earlier, we kept the the original test set unchanged as in [53].
7

24

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

2.7

Analysis
First, we quantify the sparsity of pth hidden layer outputs using the following

measure () [54].

N2
P

p
N2 si=1
=
N2 1
N2

P
1

|yip |

i=1

It is to be noted that

yip

p 2
(yi )

(2.9)

represents the output of a node i in layer p of a SMLP. Furthermore,

0 1, and the value of is one for maximally sparse and close to zero for minimally
sparse representations. Table 2.2 lists the average value of the first hidden layer outputs
over the validation data for various phoneme recognition systems. As expected, SMLP based
systems have significantly higher values than four layer MLP systems which indicates the
effectiveness of the sparse regularization term.
Table 2.2: Average measure of sparsity of the first hidden layer outputs of SMLP and four
layer MLP for various feature streams.
PLP

FDLP

MLDA

MLP (4 layers)

0.275

0.278

0.282

SMLP (4 layers)

0.496

0.552

0.540

Second, we experimentally verify whether sparse features tend to be more linearly


separable than non-sparse counterparts. After training the hierarchical system for each
feature stream, the first hidden layer outputs of SMLP or four layer MLP classifier are used
as input features for training a single layer perceptron (linear classifier) to estimate the 3state phoneme posterior probabilities. The second MLP in the hierarchy remains unchanged.
25

CHAPTER 2. IMPROVED HYBRID PHONEME RECOGNITION SYSTEM

Table 2.3 shows the PER of the resulting hierarchical system for various acoustic feature
streams. The linear classifier is able to model the SMLP features better than it models the
MLP features.
Table 2.3: PER (in %) on TIMIT test set for various acoustic feature streams when the
3-state phoneme posteriors are obtained using a single layer perceptron trained on the first
hidden layer outputs of SMLP or MLP classifier.
PLP

FDLP

MLDA

Hidden layer features from MLP (4 layers)

26.9

27.0

26.5

Hidden layer features from SMLP (4 layers)

25.0

25.5

24.6

Finally, results using only a single multilayer perceptron (without hierarchy) are
analyzed. A single multilayer layer perceptron (SMLP or MLP) is trained directly to estimate the single state phoneme posterior probabilities which are decoded as described in
section 2.5.5. The value of in SMLP cost function is optimized on the validation data.
Table 2.4 summarizes the PER of various feature streams. It can be observed from this table
that the SMLP system consistently outperforms the corresponding baseline MLP systems.
Table 2.4: PER (in %) on TIMIT test set using a single multilayer perceptron (without
hierarchy).
PLP

FDLP

MLDA

MLP (3 layers)

27.2

26.3

27.1

MLP (4 layers)

27.3

27.5

27.0

SMLP (4 layers)

26.6

25.8

26.0

26

Chapter 3

Weighted Least Squares based


Auto-Associative Neural Networks
for Speaker Verification
3.1

Chapter Outline
Auto-Associative Neural Network (AANN) [55] is a fully connected feed-forward

neural network trained to reconstruct the input at its output through a hidden bottleneck
layer. Existing AANN based speaker verification systems [9,10] use the reconstruction error
difference computed using the universal background model (UBM) AANN and the speakerspecific AANN models as a score for making decision, as shown in the Fig. 3.1. The
UBM-AANN is obtained by training an AANN on multiple held out speakers. Where as
the speaker-specific AANN is obtained by adapting (or retraining) the UBM-AANN using

27

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

Average reconstruction error


UBM AANN

Test
speech

Score

Feature
extraction

Difference

Speaker specific
AANN

Average reconstruction error

Figure 3.1: Block schematic of the AANN based speaker verification system.
corresponding speaker data.

In this chapter, we propose to project the speaker-specific AANN parameters


onto a low-dimensional subspace, and build a probabilistic linear discriminant analysis
(PLDA) model on resultant vectors in the subspace to perform hypothesis testing. The
low-dimensional subspace is learned using large amounts of development data to preserve
most of the variability of speaker-specific AANN parameters in a weighted least squares
(WLS) sense. Experimental results show that the proposed WLS based AANN speaker
verification system outperforms the existing AANN speaker verification system on NIST-08
speaker recognition evaluation.

The remainder of this chapter is organized as follows. AANNs are introduced in the
next section. Section 3.3 describes the previously proposed speaker verification system. The
28

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

Nonlinear hidden
layers
Linear input
layer

Linear output
layer

Input

Output

Number 39
of nodes

20

39

39

Figure 3.2: Auto-Associative Neural Network.

proposed WLS formulation of AANNs is presented in section 3.4. The speaker verification
system based on WLS of AANNs is described in section 3.5. Finally, experimental results
are shown in section 3.6.

3.2

Auto-Associative Neural Networks

AANN is a fully connected feed-forward neural network with a hidden compression


layer, shown in Fig. 3.2, trained for auto-encoding task [55]. Five layer AANNs are used in
all our experiments. This architecture consists of three non-linear hidden layers between the
linear input and output layers. The second hidden layer contains fewer nodes than the input
layer, and is known as the compression layer. AANN is used as an alternative to GMM
29

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

for modeling the distribution of data [9], and some of its advantages are - it relaxes the
assumption of feature vectors to be locally normal and can capture higher order moments.
For an input vector f , the network produces an output f (f , ) which depends both
on the input f and the parameters of the network (the set of weights and biases). For
simplicity, we denote the network output as f . While training the network, the parameters
of the network are adjusted to minimize typically the average squared error cost between the
input f and the output f over the training data as in (3.1). The network is trained using the
stochastic gradient descent, where gradient is computed using the error back-propagation
algorithm.
h
i
min E kf f k2 .
{}

(3.1)

Once this network is well trained, the average reconstruction error of input vectors that are
drawn from the distribution of the training data will be small compared to vectors drawn
from a different distributions [9].

3.3

Speaker Verification using AANNs

The goal of the speaker verification is to verify whether a given utterance belongs
to a claimed speaker or not based on a sample utterance from claimed speaker. In other
words, the task is to verify whether a given two utterances of a speaker verification trial
belong to the same speaker or not. The block diagram of a previously proposed AANN
based speaker verification system [9, 10] is shown in the Fig. 3.1. The various components

30

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

of this system are described below.

3.3.1

Feature Extraction

The acoustic features used in our experiments are 39 dimensional frequency domain linear prediction (FDLP) features [5659]. In this technique, sub-band temporal envelopes of
speech are first estimated in narrow sub-bands (96 linear bands). These sub-band envelopes
are then gain normalized to remove reverberation and channel artifacts. After normalization, the frequency axis is warped to 37 Mel bands in the frequency range of 125-3800 Hz to
derive a gain normalized mel scale energy representation of speech. These mel band energies are converted to cepstral coefficients by applying a log and Discrete Cosine Transform
(DCT). The top 13 cepstral coefficients along with derivative and acceleration components
are used as features, yielding 39 dimensional feature vectors. Finally, a subset of these
feature vectors corresponding to speech are selected based on the voice activity detection
information provided by NIST.

3.3.2

UBM-AANN
The concept of UBM is introduced in [60], where a GMM trained on data from

multiple speakers is used as a UBM. In our work, UBMs are obtained by training AANNs
on development data consisting of multiple speakers (described below) [9].
Gender-specific AANN based UBMs are trained on a telephone development data
set consisting of audio from the NIST 2004 speaker recognition database, the Switchboard-2
Phase III corpora and the NIST 2005 speaker recognition database. We use only 400 male
and 400 female utterances each corresponding to about 17 hours of speech.
31

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

AANN based UBMs are trained using the FDLP features (see Section 3.3.1) to
minimize the reconstruction error loss function described in section 3.2. Each UBM has a
linear input and a linear output layers along with three nonlinear (tanh nonlinearity) hidden
layers. Both input and output layers have 39 nodes corresponding to the dimensionality of
the input FDLP features. First, second and third hidden layers have 20, 6 and 39 nodes
respectively. Schematic of an AANN with this architecture is shown in the Fig. 3.2. We
have modified the Quicknet package for training the AANNs [48].

3.3.3

Speaker-Specific AANN

A speaker-specific AANN model is obtained by retraining the entire UBM-AANN using the
corresponding speaker data [9]. However, we have observed that the performance can be
improved by retraining only the weight matrix connecting third hidden layer and output
layer. This may be due to limited amount of speaker-specific data. Thus, an improved
baseline is used in our experiments, where a speaker-specific AANN is obtained by adapting
only UBM-AANN weights that impinge on the output layer.

3.3.4

Score Computation

During the test phase, the average reconstruction error of the test data is computed under
both the UBM-AANN and the claimed speaker AANN models. The final score of a trial is
computed as the difference between these average reconstruction errors. In the ideal case,
if the claim is verified, the average reconstruction error of the test data is large under the
UBM-AANN model than the claimed speaker AANN model, and vice versa.

32

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

3.4

Weighted Least Squares of AANNs


Conventional speaker verification systems use likelihood ratio between GMM based

UBM and its maximum a posteriori (MAP) adapted speaker-specific model for making decision [60]. More recently proposed GMM factor analysis techniques use a low-dimensional
subspace(s) in part of the GMM parameter space to model the speaker and channel variabilities [6165], and extract coordinates, known as i-vector [63], corresponding to a given
utterance of a speaker in the low-dimensional subspace. The i-vectors are treated as features while training the probabilistic linear discriminant analysis (PLDA) for hypothesis
testing [66, 67].

In this section, we propose to learn a low-dimensional subspace which preserves


most of the variability of the adapted weight matrices of speaker-specific AANNs in a
weighted least squares (WLS) sense. The resultant low-dimensional representation of an
utterance (also known as i-vector) is obtained by projecting the corresponding adapted
weight matrix onto the subspace.

Subspace Modeling of AANNs


The subspace modeling of adapted weight matrices is formulated as follows. The
development data consists of m speakers with l(s) sessions for sth speaker. The weight
matrix connecting third hidden layer and output layer of UBM-AANN is adapted for each
session of a speaker to obtain speaker and session specific AANN model. We denote the
adapted vectorized weights (a closed-form expression for adaptation is derived in section
3.5.1) corresponding to lth session and sth speaker as wl,s , and number of frames with nl,s .
33

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION
Let us define mean w and covariance1 of wl,s as

1
l(s)
l(s)
m X
m X
X
X

w =
nl,s
nl,s wl,s ,
s=1 l=1

s=1 l=1

s=1 l=1

s=1 l=1

1
l(s)
l(s)
m X
m X
X
X

=
nl,s
nl,s (wl,s w) (wl,s w)T .
We model wl,s using a low dimensional affine subspace parameterized by a matrix
T i.e., wl,s w + Tql,s , where ql,s represents the unknown i-vector associated with the
lth session of sth speaker. To find T, the following weighted least squares cost function is
minimized with respect to its arguments:

L T, q1,1 , . . . , ql(m),m
=

l(s)
m X
X

k [wl,s (w + Tql,s )] k21 nl,s

s=1 l=1
l(s)
m X
X

tr TT 1 nl,s T + qTl,s ql,s


+
s=1 l=1

{z

(3.2)

regularization term

where k.kA denotes a norm given by kxk2A = xT Ax, and is a small positive constant.
Differentiating (3.2) with respect to ql,s and setting equal to zero yields,
L
=0
ql,s
TT 1 nl,s [wl,s (w + Tql,s )] + ql,s = 0

1 T 1
T nl,s (wl,s w)
ql,s = I + TT 1 nl,s T
1

Only diagonal part is used.

34

(3.3)

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

Differentiating (3.2) with respect to T and setting it equal to zero yields,


L
=0
T
l(s)
m X
X
1

nl,s Tql,s qTl,s 1 nl,s [wl,s w] qTl,s + 1 nl,s T = 0


s=1 l=1
l(s)
m X
X

l(s)

m X

X
T
nl,s T I + ql,s ql,s =
1 nl,s [wl,s w] qTl,s

s=1 l=1

(3.4)

s=1 l=1

To solution for T is obtained by applying the coordinate gradient descent i.e., (3.3)
and (3.4) are iteratively solved for several times. In other words, for a given T, we first find
the i-vectors {q1,1 , . . . , ql(m),m } using (3.3). In the next step, we solve for T in (3.4) using
the i-vectors {q1,1 , . . . , ql(m),m } found in the previous step. This procedure is repeated for
several times until convergence.

The above update equations can be compared with the total variability space
training of GMMs [63]. Note that (3.3) and (3.4) resemble the maximum likelihood (ML)
update equations in [63], except for the I term in (3.4).

3.5

WLS based AANN Speaker Verification System


The block diagram of a WLS based AANN speaker verification system is shown in

Fig. 3.3. This system uses the FDLP features described in section 3.3.1. The description of
UBM-AANN can be found in section 3.3.2. The remaining components of the system are
described below.

35

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

UBMAANN
training

PLDA
training

T matrix
training

FDLP
features

ivector
extraction

Adapt
UBMAANN

Hypothesis
testing

Figure 3.3: WLS based AANN speaker verification system.

3.5.1

Closed-form Expression for Adapting UBM-AANN


The weight matrix connecting third hidden layer and output layer of UBM-AANN

is adapted for each utterance to obtain a speaker specific model. It is possible to derive the
closed-form solution for updating the speaker-specific weight matrix Wl,s connecting third
hidden and output layers. The output bias vector b of UBM-AANN is not adapted.

Let fi,l,s be the ith feature vector (frame) of an utterance corresponding to lth
session of sth speaker, and n(l, s) be the number of such frames in that utterance. The
third hidden layer output vector of UBM-AANN corresponding to this input is denoted
with hi,l,s . The loss function (3.5) is minimized to obtain the speaker-specific weight matrix
Wl,s corresponding to lth session of sth speaker. Where is non-negative and controls the
amount of regularization.
n(l,s) h

L (Wl,s ) =

i
T
kfi,l,s b Wl,s hi,l,s k22 + tr Wl,s Wl,s
.

(3.5)

i=1

Differentiating the expression above with respect to Wl,s and setting it to zero yields,
L (Wl,s )
Wl,s

n(l,s) h

= 0

i
X
T
hi,l,s hTi,l,s + I Wl,s
hi,l,s (fi,l,s b)T = 0.
i=1

n(l,s)

Wl,s =

1
X

hi,l,s hTi,l,s + I .
(fi,l,s b) hTi,l,s
n(l,s)
i=1

i=1

36

(3.6)

Scores

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

3.5.2

T-matrix Training

Gender dependent low-dimensional subspaces (T matrices) are trained to capture most


of the variability of adapted weights in a WLS sense as described in section 3.4. The
development data for training the subspaces consists of Switchboard-2, Phases II and III;
Switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE [62]. The total number of
male and female utterances is 12266 and 14936 respectively.

3.5.3

i-vectors

Each utterance is converted to an i-vector using (3.3) with an appropriate gender-specific


T matrix. All i-vectors are normalized to have unit length to reduce the mismatch during
training and testing [68].

3.5.4

PLDA training

PLDA is a generative model for observations [66, 67], in our case i-vectors. The i-vectors
are assumed to be generated as
ql,s = + s + l,s ,

(3.7)

where is an offset; is a matrix fewer columns than rows representing a low-dimensional


subspace in i-vector space; s is a latent identity variable having a normal distribution with
mean zero and covariance matrix identity; and l,s is a residual noise term assumed to be
Gaussian with mean zero and full covariance matrix . Additionally, these variables are
assumed to be independent.

Gender-specific PLDA models are trained using the same development data that
37

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

is used for training T matrices (see Section 3.5.2). The maximum likelihood estimates of
the model parameters {, , } are obtained using an Expectation Maximization (EM)
algorithm [66].

3.5.5

Hypothesis Testing

Given two i-vectors q1 , q2 of a speaker verification trial, we need to test whether they
belong to the same speaker (Hs ) or different speakers (Hd ). For the Gaussian PLDA above,
the log-likelihood ratio can be computed in a closed-form as
score = log

p (q1 , q2 |Hs )
p (q1 |Hd ) p (q2 |Hd )

(3.8)

T
T
q1 +

N ; ,

q2

T
T +
= log
,
T
0
q1 +

N ; ,

q2

0
T +

where N (.; , ) is a multivariate Gaussian density with mean and covariance . The
above score can be computed efficiently as described in [64, 68].

3.6

Experimental Results
Speaker verification systems are tested on telephone conditions of NIST-2008

speaker recognition evaluation (SRE). Table 3.1 shows the description of various telephone
conditions. Table 3.2 lists the EER and minimum detection cost function (minDCF) of
NIST-2008 for baseline AANN (see section 3.3) and WLS based AANN (see section 3.5)
speaker verification systems. These neural network based systems use the same UBM38

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

Table 3.1: Description of various telephone conditions of NIST-08.


C6

Telephone speech in training and test

C7

English language telephone speech in training and test

C8

English language telephone speech spoken by a


native U.S. English speaker in training and test

Table 3.2: EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8
of NIST-08.
System

C6

C7

C8

Baseline AANN

19.5 (87.3)

15.8 (73.0)

15.2 (74.8)

PCA of AANNs, = 0

13.4 (67.6)

8.7 (41.4)

7.9 (39.4)

12.2 (66.2)

7.2 (38.3)

6.4 (35.6)

10.7 (59.6)

5.5 (28.3)

4.4 (24.1)

240 dim. i-vector


WLS of AANNs, = 0
240 dim. i-vector
WLS of AANNs, = 0.005
240 dim. i-vector

AANN of size (39, 20, 6, 39, 39), where each number indicates the number of nodes in a
corresponding layer.

The error rates of the baseline AANN system are indicated in a first row of Table
3.2. In this case, the difference between average reconstruction errors of a UBM and a given

39

CHAPTER 3. WEIGHTED LEAST SQUARES BASED AUTO-ASSOCIATIVE


NEURAL NETWORKS FOR SPEAKER VERIFICATION

speaker-specific model is used as a score for making decision [11]. The second row of the table
lists the error rates when PCA is applied instead of WLS to reduce the dimensionality of
the speaker-specific weights. The resultant PCA based i-vectors are modeled using PLDA.
The next two rows of Table 3.2 shows the error rates of a WLS based AANN speaker
verification system that uses gender-dependent 150 dimensional (number of columns of )
subspace PLDA models in 240 dimensional i-vector space. The speaker-specific weights
Wl,s are derived using (3.6) for different values of .

These results suggest that the proposed AANN based i-vector/PLDA framework
outperforms the baseline AANN speaker verification system. Additionally, the proposed
WLS formulation for obtaining i-vectors yields better results than the simple PCA. Moreover, regularized speaker-specific weights ( = 0.005) yield much better results. It can be
observed that a relative improvement of 59.2% in EER and 52.3% in minDCF is obtained
using the WLS based AANN system over the baseline AANN system.

40

Chapter 4

Factor Analysis of
Auto-Associative Neural Networks
for Speaker Verification
4.1

Chapter Outline
The main disadvantage of WLS of AANNs is that it requires speaker-specific adap-

tation of the weight matrix connecting third hidden and output layers despite a much
low-dimensional representation (i-vector) is being used (see section 3.4) for the subsequent
modeling. In this chapter, we introduce and develop the factor analysis (FA) theory of
AANNs to alleviate this problem. This is achieved by regularizing each speaker-specific
weight matrix by restricting it to a common low-dimensional subspace during adaptation.
The subspace is learned using large amounts of development data, and is held fixed during

41

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION

adaptation. Thus, only the coordinates in a subspace, also known as i-vector, need to be
estimated using the speaker-specific data. Unlike the WLS of AANNs approach, we adapt
the weight matrix directly in a low-dimensional common subspace. The update equations
are derived for learning both the common low-dimensional subspace and the i-vectors corresponding to speakers in the subspace. The resultant i-vector representation is used as
features for the subsequent PLDA model. Proposed system shows promising results on the
NIST-08 SRE, and yields 12% relative improvement in EER over the WLS based AANN
speaker verification system described in section 3.5.

The remainder of the chapter is organized as follows. In section 4.2, the factor
analysis (FA) of AANNs is developed. The FA based AANN speaker verification system is
described in section 4.3. Experimental results are provided in section 4.4.

4.2

Factor analysis of AANNs


The idea of FA is to constrain the weight matrix connecting the third hidden layer

and output layer of each speaker-specific AANN model to lie on a common low-dimensional
subspace such that it minimizes the overall loss function over the entire development data.
In this process, the rest of the parameters of each AANN are held fixed at values learned
during speaker independent training such as UBM (see section 3.3.2). The proposed loss
function is different from the one used in section 3.5. The notations used in this section are
summarized below.

m - number of speakers
42

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION
l(s)- number of sessions of the sth speaker
n(l, s) - number of frames in lth session of sth speaker
fi,l,s - ith acoustic feature vector of lth session, sth speaker
d - dimensionality of fi,l,s
hi,l,s - fourth layer output of UBM-AANN for input fi,l,s
d0 - dimensionality of hi,l,s
Wl,s - Weight matrix connecting third hidden layer and output layer of AANN specific to
lth session of sth speaker
b - output bias vector of UBM-AANN
ei,l,s - error vector of UBM-AANN for input fi,l,s

In the AANNs loss function below, we first vectorize Wl,s so that a subspace
structure can be imposed on it. The loss function with speaker and session specific weights
is given by
L =

l(s) n(l,s)
m X
X
X

kfi,l,s b Wl,s hi,l,s k22

s=1 l=1 i=1

l(s) n(l,s)
m X
X
X

kfi,l,s b Hi,l,s wl,s k22 ,

s=1 l=1 i=1

43

(4.1)

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION

where,
wl,s = Row ordered(Wl,s ),

Hi,l,s = Id (hi,l,s )T

(hi,l,s )

0
.
.
(hi,l,s )T

ddd0

The dimensionality of Wl,s is d d0 and that of wl,s is dd0 1. The vector wl,s is obtained
by arranging rows of Wl,s as columns one after the other. Matrix Hi,l,s is equal to the
Kronecker product of Id (a d d identity matrix) and (hi,l,s )T .

We can rewrite (4.1) as


L(w1,1 , . . . , wl(m),m ) =

l(s) n(l,s)
m X
X
X

[fi,l,s b Hi,l,s wl,s ]T [fi,l,s b Hi,l,s wl,s ] .

(4.2)

s=1 l=1 i=1

The factor analysis model (or subspace constraint) for the vectorized weights wl,s
is
wl,s wubm + T ql,s ,
where wubm represents the speaker independent (UBM) vector of weights connecting third
hidden layer and output layer, T is a matrix having fewer columns than rows representing
the common low-dimensional subspace, and ql,s is a vector of coordinates in the subspace
or an i-vector associated with the lth session of sth speaker. By substituting this factor
44

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION

analysis model in (4.2), (note that the loss function depends only on (T, {ql,s }) as we know
wubm )
l(s) n(l,s)
m X
X

X
L T, q1,1 , . . . , ql(m),m =
[fi,l,s b Hi,l,s (wubm + Tql,s )]T
s=1 l=1 i=1

[fi,l,s b Hi,l,s (wubm + Tql,s )] .

(4.3)

The reconstruction error vector of UBM is defined as


.
ei,l,s = fi,l,s b Hi,l,s wubm .

(4.4)

By substituting the expression above in (4.3),


l(s) n(l,s)
m X
X

X
L T, q1,1 , . . . , ql(m),m =
[ei,l,s Hi,l,s Tql,s ]T [ei,l,s Hi,l,s Tql,s ] .

(4.5)

s=1 l=1 i=1

Let us define the statistics of lth session of sth speaker as


n(l,s)
. X T
F1 (l, s) =
Hi,l,s ei,l,s

(4.6)

i=1
n(l,s)
. X T
F2 (l, s) =
Hi,l,s Hi,l,s

(4.7)

i=1

We can rewrite (4.5) using (4.6) and (4.7) as,

L T, q1,1 , . . . , ql(m),m =

l(s)
n(l,s)
m X
X
X

eTi,l,s ei,l,s 2qTl,s TT F1 (l, s) + qTl,s TT F2 (l, s) Tql,s .


s=1 l=1

(4.8)

i=1

The low-dimensional subspace T can be learned by minimizing the loss function in


(4.8) using the coordinate descent. In the first step, (4.8) is minimized with respect to {ql,s }
by keeping T fixed. In the second step, (4.8) is minimized with respect to T by keeping
{ql,s } fixed at the values found in step one. Note that the loss function is convex in each
45

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION

UBMAANN
training
FDLP
features

PLDA
training

T matrix
training
ivector
extraction

Extract
statistics

Hypothesis
testing

Figure 4.1: Block schematic of the proposed FA based AANN speaker verification system.
step, and therefore the optima is found by setting the gradient of the loss function with
respect to the corresponding variable to zero. These steps are repeated until convergence.

Differentiating (4.8) with respect to ql,s and setting it to zero yields,

L
= 0 2TT F1 (l, s) + 2TT F2 (l, s) Tql,s = 0
ql,s

1 T
ql,s = TT F2 (l, s) T
T F1 (l, s)

(4.9)

Differentiating (4.8) with respect to T and setting it equal to zero yields,


m l(s)

XX
L
2F1 (l, s) qTl,s + 2F2 (l, s) Tql,s qTl,s = 0,
=0
T

(4.10)

s=1 l=1

where we solve for T by solving a set of linear equations involving entries of the matrix T.

4.3

FA based AANN Speaker Verification System

The block diagram of the proposed FA based AANN speaker verification system
is shown in the Fig. 4.1. This system resembles the WLS based AANN speaker verification
system described in section 3.5. These systems differ in the way UBM-AANN is adapted. In
the FA approach, the matrix connecting third hidden and output layers of a UBM-AANN
46

Scores

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION

is adapted in a common low-dimensional subspace to obtain an i-vector. Where as in WLS


based approach, the entire matrix is adapted first using (3.6) and then it is projected on to
a low-dimensional subspace to obtain an i-vector. The remainder of this section describes
the stages which are different from WLS based AANN speaker verification system shown
in the Fig. 3.3.

4.3.1

Statistics

The statistics in (4.6) and (4.7) are precomputed for each utterance that corresponds to a
particular speaker and a session. Appropriate gender-specific UBM is used for computing
the statistics. Note that we need to compute only few entries of F2 (l, s) as its entries
are redundant. These statistics are sufficient for training the T matrix and extracting the
i-vectors.

4.3.2

T-matrix Training

Gender dependent low-dimensional subspaces (T matrices) are trained as described in Section 4.2. The development data for training the subspaces consists of Switchboard-2, Phases
II and III; Switchboard Cellular, Parts 1 and 2 and NIST 2004-2005 SRE [62]. The total
number of male and female utterances is 12266 and 14936 respectively. We initialize the
T matrix with a Gaussian noise and learn the subspace as described in Section 4.2 using
coordinate gradient descent.

47

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION
Table 4.1: EER in % and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8
of NIST-08.
System

C6

C7

C8

Baseline AANN

19.5 (87.3)

15.8 (73.0)

15.2 (74.8)

WLS of AANNs, 240 dim. i-vector

10.7 (59.6)

5.5 (28.3)

4.4 (24.1)

9.6 (55.1)

4.7 (25.3)

3.8 (20.1)

(closed-form = 0.005, from Table 3.2)


FA of AANNs
240 dim. i-vector

4.3.3

i-vectors

Each utterance is converted to an i-vector using (4.9), using an appropriate gender-specific


T matrix. All i-vectors are normalized to have unit length to reduce the mismatch during
training and testing [68].

4.4

Experimental Results
Speaker verification systems are tested on telephone conditions of NIST-2008

speaker recognition evaluation (SRE). The details of baseline AANN and WLS based AANN
speaker verification systems can be found in section 3.6. All neural network based systems
use the same UBM-AANN of size (39, 20, 6, 39, 39), where each number indicates the number of nodes in a corresponding layer1 . Table 3.1 shows the description of various telephone
1

The number of nodes in each hidden layer is optimized one at a time on NIST-08 to obtain the best
performance of the FA based AANN system. Where as, the number of nodes in input or output layers is
fixed by the dimensionality of feature vectors.

48

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION
Table 4.2: Comparison with the state-of-art GMM based i-vector/PLDA system. EER in
% and minDCF x 103 (shown in brackets) on conditions C6, C7 and C8 of NIST-08.
System
FA of AANNs

C6

C7

C8

9.6 (55.1)

4.7 (25.3)

3.8 (20.1)

7.0 (41.3)

2.8 (14.8)

2.1 (10.8)

240 dim. i-vector


GMM
400 dim. i-vector

conditions. Table 4.1 lists the EER and minimum detection cost function (minDCF) of
NIST-2008 for all the systems. The last row of the table shows the error rates of the FA
based AANN system which uses gender-dependent 150 dimensional (number of columns
of ) subspace PLDA models in 240 dimensional i-vector space. Results indicate that
FA based AANN speaker verification system outperforms baseline AANN and WLS based
AANN speaker verification systems in all conditions. It can also be observed that the
proposed FA approach yields 12.1% relative improvement in EER and 10.2% relative improvement in minDCF over the best WLS based AANN system.

4.4.1

Comparison with GMM based i-vector/PLDA System


The block diagram in Fig. 4.1 is also applicable to GMM based i-vector/PLDA

system, except the UBM is based on a GMM. Each GMM based UBM consists of 1024 mixture components with diagonal covariances. The male and female UBMs are trained using
FDLP features extracted from 4324 and 5461 utterances of development data respectively.
Gender-specific 400 dimensional total variability space (T matrix) is trained as described

49

CHAPTER 4. FACTOR ANALYSIS OF AUTO-ASSOCIATIVE NEURAL


NETWORKS FOR SPEAKER VERIFICATION

in [63]. The i-vectors of this space are length normalized and subsequently used for training
a gender-dependent PLDA system with 250 dimensional subspace. Note that the development data used for training the T matrices and the PLDA models is same as that of the
FA based AANN speaker verification system (see section 4.3).

A gender-specific GMM based i-vector/PLDA system is also trained for comparison. The results are shown in Table 4.2. Although significant improvements are achieved
using the proposed FA of AANNs, GMM based i-vector/PLDA system continues to perform
better. However, further work on neural network based systems might close the existing
performance gap, and bring forward possible advantages of this alternative nonlinear neural
network based modeling in speaker verification.

50

Chapter 5

Conclusions
5.1

Conclusions
Discriminative MLDA features were proposed for phoneme recognition in section

2.3. The SMLP classifier was proposed in section 2.4.1. It has been shown that one of
its hidden layer outputs can be forced to be sparse by adding a sparse regularization term
to the cross-entropy cost function. Update equations were derived for training the SMLP.
Finally, a multi-stream phoneme recognition system based on SMLP has been shown to
outperform its MLP counterpart.

The closed-form expression for adapting the AANN weight matrix connecting the
third and output layers with regularization was derived in chapter 3. It was further regularized by projecting it onto a low-dimensional subspace, which was learned to preserve most
of the variability of the adapted weight matrices in a WLS sense. Each speaker was modeled
using a projection in the subspace (an i-vector). The resultant speaker verification system

51

CHAPTER 5. CONCLUSIONS

based on i-vectors achieved an order of magnitude better performance than the existing
AANN based speaker verification system.

The theory of FA of AANNs was introduced to directly adapt the AANN weight
matrix connecting the third and output layers in a low-dimensional subspace, that captures
variability of weight matrices (see section 4.2). This particular way of regularizing adaptation parameters of AANNs has been shown to yield better performance than the WLS
approach described in section 3.4.

5.2

Future Work
The future work includes applying SMLP to extract data-driven features for ASR

in TANDEM framework. Also, to replace MLP with SMLP in other pattern classification
applications.

An interesting extension of FA of AANNs would be to add additional terms to the


cost function (4.5) such that within class i-vectors become close to each other and between
class i-vectors tend to be far from each other. Further, developing a FA formulation for the
MLP based acoustic modeling to perform speaker adaptation in ASR.

52

Appendix A

Standard Error Back-Propagation


Gradients of L w.r.t. yjm can be computed from the equation (2.2) as follows,
j {1, 2, ..., Nm },
dj
L
= m.
m
yj
yj

(A.1)

For softmax non-linearity, i {1, 2, ..., Nm },


m

yjm

yjm
ex j
m
m

m = yj (Iij yi ) ,
N
m
x
P
m
i
exk

(A.2)

k=1

where,
Iij

if i == j,

otherwise.

Thus using (A.1) and (A.2), gradients of L w.r.t. xm


i i {1, 2, ..., Nm } are given
by,
N

m
X
L
=
xm
i

j=1

L
yjm

yjm
xm
i

= yjm dj

(A.3)

Standard error back-propagation algorithm expresses gradients of L w.r.t. yjm1

53

APPENDIX A.

STANDARD ERROR BACK-PROPAGATION

in terms of previously computed gradients in (A.3).


L
yjm1

Nm
X
L
=
xm
i

i=1

Nm
X

xm
i
yjm1

m1
(yim di ) wji
.

(A.4)

i=1

In general l m 1 , given gradients of L w.r.t. yjl , j {1, 2, ..., Nl }, gradient


of L w.r.t. yil1 for any i {1, 2, ..., Nl1 } is given by (similar to (2.6)),
L
yil1

Nl
X
j=1

Nl
X
j=1

Nl
X
j=1

L
yjl
L
yjl
L
yjl

yjl
xlj

xlj
yil1


0
l1
l xlj wij

!
l1
yjl (1 yjl ) wij

where,

yjl = l xlj =

1
.
1 + exp(xlj )

54

Bibliography
[1] J. Dines, J. Vepa, and T. Hain, The segmentation of multi-channel meeting recordings
for automatic speech recognition, in INTERSPEECH, 2006.
[2] M. Lehtonen, P. Fousek, and H. Hermansky, Hierarchical approach for spotting keywords, IDIAP Research Report, no. 0541, 2005.
[3] J. Pinto, I. Szoke, S. Prasanna, and H. Hermansky, Fast approximate spoken term
detection from sequence of phonemes, in Proceedings of the ACM SIGIR Workshop
on Searching Spontaneous Conversational Speech, 2008, pp. 0845.
[4] H. Bourlard and N. Morgan, Connectionist speech recognition: a hybrid approach.
Kluwer Academic Pub, 1994.
[5] H. Hermansky, D. Ellis, and S. Sharma, Tandem connectionist feature extraction for
conventional hmm systems, in Proc. of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2000.
[6] B. Chen, Q. Zhu, and N. Morgan, Learning long-term temporal features in lvcsr using
neural networks, in INTERSPEECH, 2004.

55

BIBLIOGRAPHY

[7] F. Grezl, M. Karafiat, S. Kont


ar, and J. Cernocky, Probabilistic and bottle-neck
features for lvcsr of meetings, in Proc. of International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2007.
[8] N. Morgan, Deep and wide: Multiple layers in automatic speech recognition, IEEE
Transactions on Audio, Speech, and Language Processing, no. 99, 2011.
[9] B. Yegnanarayana and S. Kishore, Aann: an alternative to gmm for pattern recognition, Neural Networks, vol. 15, no. 3, pp. 459469, 2002.
[10] K. Murty and B. Yegnanarayana, Combining evidence from residual phase and mfcc
features for speaker recognition, IEEE Signal Processing Letters, vol. 13, no. 1, pp.
5255, 2006.
[11] G. Sivaram, S. Thomas, and H. Hermansky, Mixture of auto-associative neural networks for speaker verification, in INTERSPEECH, 2011.
[12] P. Matejka, P. Schwarz, J. Cernock`
y, and P. Chytil, Phonotactic language identification using high quality phoneme recognition, in Ninth European Conference on Speech
Communication and Technology, 2005.
[13] H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal
of the Acoustical Society of America, vol. 87, pp. 17381752, 1990.
[14] S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on
Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357366, 1980.

56

BIBLIOGRAPHY

[15] B. Kingsbury, N. Morgan, and S. Greenberg, Robust speech recognition using the
modulation spectrogram, Speech Communication, vol. 25, no. 1, pp. 117132, 1998.
[16] H. Hermansky and P. Fousek, Multi-resolution RASTA filtering for TANDEM-based
ASR, in INTERSPEECH, 2005.
[17] S. Ganapathy, S. Thomas, and H. Hermansky, Modulation frequency features for
phoneme recognition in noisy speech, The Journal of the Acoustical Society of America
- Express Letters, vol. 125, no. 1, pp. 812, 2009.
[18] M. Kleinschmidt and D. Gelbart, Improving word accuracy with Gabor feature extraction, in Proc. of ICSLP.

USA, 2002.

[19] S. Zhao and N. Morgan, Multi-stream spectro-temporal features for robust speech
recognition, in INTERSPEECH. Brisbane, Australia, 2008.
[20] N. Mesgarani, G. Sivaram, S. K. Nemala, M. Elhilali, and H. Hermansky, Discriminant
Spectrotemporal Features for Phoneme Recognition, in INTERSPEECH. Brighton,
2009.
[21] M. Richard and R. Lippmann, Neural network classifiers estimate Bayesian a posteriori probabilities, Neural computation, vol. 3, no. 4, pp. 461483, 1991.
[22] N. Morgan, Q. Zhu, A. Stolcke, K. Sonmez, S. Sivadas, T. Shinozaki, M. Ostendorf,
P. Jain, H. Hermansky, D. Ellis et al., Pushing the envelope-aside [speech recognition], IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 8188, 2005.
[23] V. Balakrishnan, G. Sivaram, and S. Khudanpur, Dirichlet mixture models of neural
57

BIBLIOGRAPHY

net posteriors for hmm-based speech recognition, in Proc. of International Conference


on Acoustics, Speech, and Signal Processing (ICASSP), 2011.
[24] F. Grezl, M. Karafiat, S. Kont
ar, and J. Cernocky, Probabilistic and bottle-neck
features for lvcsr of meetings, in Proc. of International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2007.
[25] Q. Zhu, B. Chen, N. Morgan, and A. Stolcke, On using mlp features in lvcsr, in
INTERSPEECH, 2004.
[26] J. Park, F. Diehl, M. Gales, M. Tomalin, and P. Woodland, Training and adapting
mlp features for arabic speech recognition, in Proc. of International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), 2009.
[27] T. Elliott and F. Theunissen, The modulation transfer function for speech intelligibility, PLoS computational biology, vol. 5, no. 3, 2009.
[28] B. Meyer and B. Kollmeier, Optimization and evaluation of Gabor feature sets for
ASR, in INTERSPEECH.

Brisbane, Australia, 2008.

[29] D. Depireux, J. Simon, D. Klein, and S. Shamma, Spectro-temporal response field


characterization with dynamic ripples in ferret primary auditory cortex, Journal of
Neurophysiology, vol. 85, no. 3, pp. 12201234, 2001.
[30] G. Sivaram, S. Nemala, N. Mesgarani, and H. Hermansky, Data-driven and feedback based spectro-temporal features for speech recognition, IEEE Signal Processing
Letters, vol. 17, no. 11, pp. 957960, 2010.

58

BIBLIOGRAPHY

[31] TIMIT database, Available: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=L


[32] K. Lee and H. Hon, Speaker-independent phone recognition using hidden Markov
models, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37,
no. 11, pp. 16411648, 1989.
[33] S. Chen and D. Li, Modified linear discriminant analysis, Pattern Recognition, vol. 38,
no. 3, pp. 441443, 2005.
[34] B. Olshausen and D. Field, Sparse coding with an overcomplete basis set: A strategy
employed by v1? Vision research, vol. 37, no. 23, pp. 33113325, 1997.
[35] H. Lee, C. Ekanadham, and A. Ng, Sparse deep belief net model for visual area v2,
Advances in neural information processing systems, vol. 20, pp. 873880, 2008.
[36] K. Huang and S. Aviyente, Sparse representation for signal classification, Advances
in neural information processing systems, vol. 19, pp. 609616, 2007.
[37] M. Ranzato, Y. Boureau, and Y. LeCun, Sparse feature learning for deep belief networks, Advances in neural information processing systems, vol. 20, pp. 11851192,
2007.
[38] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, Robust face recognition via
sparse representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210227, 2009.
[39] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, Supervised dictionary

59

BIBLIOGRAPHY

learning, Advances in neural information processing systems, vol. 21, pp. 10331040,
2008.
[40] T. Sainath, A. Carmi, D. Kanevsky, and B. Ramabhadran, Bayesian compressive
sensing for phonetic classification, in Proc. of International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2010.
[41] J. Gemmeke and T. Virtanen, Noise robust exemplar-based connected digit recognition, in Proc. of International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), 2010.
[42] G. Sivaram, S. Nemala, M. Elhilali, T. Tran, and H. Hermansky, Sparse coding for
speech recognition, in Proc. of International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2010.
[43] J. Pinto, B. Yegnanarayana, H. Hermansky, and M. Magimai.-Doss, Exploiting contextual information for improved phoneme recognition, Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2008.
[44] J. Pinto, G. Sivaram, M. Magimai.-Doss, H. Hermansky, and H. Bourlard, Analyzing
MLP Based Hierarchical Phoneme Posterior Probability Estimator, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 2, pp. 225241, 2011.
[45] H. Ketabdar and H. Bourlard, Enhanced phone posteriors for improving speech
recognition systems, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 10941106, 2010.

60

BIBLIOGRAPHY

[46] J. Pinto, M. Magimai-Doss, and H. Bourlard, Mlp based hierarchical system for task
adaptation in asr, in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2009, pp. 365370.
[47] D. Imseng, M. Doss, and H. Bourlard, Hierarchical multilayer perceptron based language identification, in INTERSPEECH, 2010.
[48] The

ICSI

Quicknet

Software

Package,

Available:

http://www.icsi.berkeley.edu/Speech/qn.html.
[49] F. Valente, Multi-stream speech recognition based on Dempster-Shafer combination
rule, Speech Communication, vol. 52, no. 3, pp. 213222, 2010.
[50] G. Shafer, A mathematical theory of evidence.

Princeton university press Princeton,

1976, vol. 1.
[51] G. Dahl, M. Ranzato, A. Mohamed, and G. Hinton, Phone recognition with the meancovariance restricted boltzmann machine, Advances in Neural Information Processing
Systems, vol. 23, pp. 469477, 2010.
[52] A. Mohamed and G. Hinton, Phone recognition using restricted boltzmann machines, in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2010.
[53] P. Schwarz, P. Matejka, and J. Cernocky, Hierarchical structures of neural networks
for phoneme recognition, in Proc. of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2006.

61

BIBLIOGRAPHY

[54] P. Hoyer, Non-negative matrix factorization with sparseness constraints, The Journal
of Machine Learning Research, vol. 5, pp. 14571469, 2004.
[55] M. Kramer, Nonlinear principal component analysis using autoassociative neural networks, AIChE journal, vol. 37, no. 2, pp. 233243, 1991.
[56] R. Kumerasan and A. Rao, Model-based approach to envelope and positive instantaneous frequency estimation of signals with speech applications, The Journal of the
Acoustical Society of America, vol. 105, pp. 19121924, 1999.
[57] M. Athineos, H. Hermansky, and D. Ellis, Plp2 autoregressive modeling of auditorylike 2-d spectrotemporal patterns, in INTERSPEECH, 2004.
[58] M. Athineos and D. Ellis, Autoregressive modeling of temporal envelopes, IEEE
Transactions on Signal Processing, vol. 55, no. 11, pp. 52375245, 2007.
[59] S. Ganapathy, J. Pelecanos, and M. Omar, Feature normalization for speaker verification in room reverberation, in Proc. of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2011.
[60] D. Reynolds, T. Quatieri, and R. Dunn, Speaker verification using adapted gaussian
mixture models, Digital signal processing, vol. 10, no. 1-3, pp. 1941, 2000.
[61] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus eigenchannels in speaker recognition, IEEE Transactions on Audio, Speech, and
Language Processing, vol. 15, no. 4, pp. 14351447, 2007.
[62] O. Glembek, L. Burget, N. Dehak, N. Brummer, and P. Kenny, Comparison of scoring
62

BIBLIOGRAPHY

methods used in speaker recognition with joint factor analysis, in Proc. of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2009.
[63] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Front-end factor
analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language
Processing, vol. 19, no. 4, pp. 788798, 2011.
[64] N. Br
ummer and E. de Villiers, The speaker partitioning problem, in Proceedings
of the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic,
2010.
[65] D. Garcia-Romero and C. Espy-Wilson, Joint factor analysis for speaker recognition reinterpreted as signal coding using overcomplete dictionaries, in Proc. Odyssey
Speaker and Language Recognition Workshop, 2010.
[66] S. Prince and J. Elder, Probabilistic linear discriminant analysis for inferences about
identity, in IEEE 11th International Conference on Computer Vision, 2007., 2007,
pp. 18.
[67] P. Kenny, Bayesian speaker verication with heavy-tailed priors, in Proceedings of the
Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, 2010.
[68] D. Garcia-Romero and C. Espy-Wilson, Analysis of i-vector length normalization in
speaker recognition systems, in INTERSPEECH, 2011.

63

You might also like