Automatic Speech Recognition

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 45

Automatic Speech

Recognition
/

How do humans do it?

Articulation produces
sound waves which
the ear conveys to the brain
for processing

How might computers do it?


Acoustic waveform

Acoustic signal

Digitization
Acoustic analysis of the
speech signal
Linguistic interpretation

Speech recognition

Speech features
Representation using features to develop models
Vocal tract time varying linear filter
Glottal pulse or noise generator signal sources
Time varying character of speech process is captured
by performing the spectral analysis
Short time analysis and repeating the analysis
periodically.

Mel-freqency cepstral
co-efficients (MFCCs)

MFCC is based on human


hearing perceptions which
cannot perceive frequencies
over 1Khz.
MFCC has two types of filter
which are spaced linearly at
low frequency below 1000 Hz
logarithmic spacing above
1000Hz

Suitability of features
Features from MFCCs are well suited to density
function models such as mixture models or Hidden
Markov Model (HMMs).
cepstra are lower order fourier co-efficients
Variations induced by the pulsed nature of vocal
excitation has minimal effect

Speaker Models

Neural network
Support vector machine (SVM)
Gaussian mixture model (GMM)
Hidden markov model (HMM)

The choice of model depends on circumstance and


specific application
Amount of data available and nature of speaker
verification problem influence choice of model.

Gaussian mixture model

GMM - Example

0.5
0.4

Component 1

Component 2

p (x)

0.3
0.2
0.1
0
-5
0.5
0.4

10

10

Mixture Model

p (x)

0.3
0.2
0.1
0
-5

0.5
0.4

Component 1

Component 2

p (x)

0.3
0.2
0.1
0
-5
0.5
0.4

10

10

Mixture Model

p (x)

0.3
0.2
0.1
0
-5

p (x)

1.5

Component Models

0.5
0
-5
0.5
0.4

10

10

Mixture Model

p (x)

0.3
0.2
0.1
0
-5

Gaussian Mixture Example:


Start

After First Iteration

After 2nd Iteration

After 3rd Iteration

Gaussian Mixture Example:


Start

After First Iteration

After 2nd Iteration

After 3rd Iteration

After 4th Iteration

After 5th Iteration

After 6th Iteration

After 20th Iteration

Model - training
Depending on the amount of data available
Class conditional training
Adaptive training
Discriminative training

Class conditional modelingmaximum likelihood training

CCM-ML (Cntd..)

Class condition modelingdiscriminative training


ML training model works fine if the training data
is sufficient
For inadequate training data, with no loss in
performance , with CCM - go for discriminative
training.
This training works with important question
How should I select the parameters of my model
or models such that we maximize the performance
on our goal of speaker separability?

CMM-DT (cntd..)
To meet this, we move away from ML criterion
Performance of SV system is measured by
Receiver Operating Characteristics (ROC) or
detection error tradeoff (DET) curve.
Model parameters are directly trained to improve
the performance even working with GMM or
HMM.

CMM via Adaptation


The process of creating speaker models via
adaptation starts with a generic model for a
speaker and uses the data collected from the target
speaker to tailor the generic model to the speaker.
Generic model prior probability distribution
Generic model is often referred as universal
background model (UBM) is typically modeled
with GMM and trained with speaker data.
Other class conditional models
Inherently discriminative approaches

Managing variability
Channel Normalisation
Mitigating the effect of the channel based on its
inherent characteristics.
It can be modeled by linear time-invariant filter with
cepstral coefficients

C r,n = C s,n +

C ch,n

Cepstral mean substraction (CMS) one method of


normalisation is quite effective.
But removes some of the speaker information.
Sometimes features shifts in feature space, but it can
be avoided by CMS. However new features that are
produced that are less sensitive to channel
variability.

Channel normalisation other methods


(approach 1)
Whether there are features that are highly robust
to channel variation?
Features must be highly robust to channel variations
and also carry significant amount of speakerdependent information.
Format-related features Format is location of peak
in short spectrum i.e. resonant freq. of vocal tract.
Variations in glottal pulse timings and fundamental
frequency do carry speaker information and have
channel invariance
Difficult to extract these features and model. (Murthy
et al.)

Channel normalisation other methods


(approach 2)
To treat channel as random disturbance and
integrate its variability directly into scoring
process of the speaker.
Estimating probablistic model for channel a
challenge.
Gaussian random vector is able to compute this
effect.

Channel normalisation other methods


(approach 3)
To construct a channel dector
Compare with channel used at the time of
enrollment
While modeling speaker, maximum a posteriori
(MAP) adaptation methods are employed
(Teunen et al.)
Faces mismatched channel problem. But effective
than normalisation technique.

Score normalisation

Normalising the scores generated after transmission is over.

Constraining the text

Measuring performance
Performance of SV systen us measured by its prob. of false
dismissal versus the prob. of false acceptance at a given
threshold value.
By changing threshold, with collection of data DET curve is
obtained.
DET for one single user is discontinuous.
National institute of standards and technology (NIST) in their
annual evaluation, collected data from different speakers and
normalised to create composite DET.
For a system with speaker or channel dependent thresholds
are used, then normalisation could lead to inappropriate DET.
The speaker dependent threshold (no channel detection) a
more appropriate way is to combine scores

Measuring performance (Cntd....)


In such cases for a given PFA average of all PFM is found
and DET is plotted.
DET or Receiver operating characteristics (ROC) provide a
great deal of information on SV system. EER quantifies in
terms of number.
Another numerical value is detection cost function (DCF)
involves assigning cost to two different types of errors.
There are so many factors affect performance of system

Amount of training data


Duration of the evaluation
The variability
Types of channels employed
Protocols employed
Constrains on the text

Alternative approaches
1. Speech recognition approaches
Cepstra features extracted from speech were used and not
exploited higher level information (phonetics).
Relying primarily on individual vocal production
mechanism.
In verification process the discriminative gain was high
However extracting them from communication channel
reduced the accuracy.
Improvement in both speech and phonetic recognisers this approach gained importance.

Alternative approaches (Cntd...)


Dragon system focused on the recognisers ability to
provide higher accoustic knowledge about speakers.
For training Baum-welch HMM procedure is followed.
Dragon approach fared well in all aspects but, fell short by
the state of art of GMM. - lack of test data.
Kimball et all gave text dependent recognition system
Uses Maximum likelihood linear regression (MLLR)
method.
Proved to be effective even on different channels.

Alternative approaches (Cntd...)


2.

Words (and phonetic units) count

Gauvain et al. proposed speaker recongnition based on phone


recogniser.
Training of accoustic models for target speaker was by
adaptation and model HMM.
Transition permitted between phones and users language.
Doddington demonstrated that language patterns (by
frequently occurring bigrams) contain a great deal of speaker
specific information.
Andrew et al demonstrated word based recognition system by
combining language recognition with acoustic information.

Alternative approaches (Cntd...)


3. Models exploring the shape of feature space
Statistical modeling through users likelihood ratios
Speaker model is characterised by single Gaussian pdf.
This model ignores fine structure of the speech, shape in
feature space.
Gish introduced eigenvalues of covariance matrices
Zilca et al consider the measures of shape and T-norm
scaling of scores in conjunction with channel detection to
produce - cellular phone environment.
This system requires less computations than GMM.

You might also like