Automatic Speech Recognition

Automatic Speech
Recognition
/
How do humans do it?
Articulation produces
sound waves which
the ear conveys to the brain
for processing
How might computers do it?

Acoustic waveform
Acoustic signal
Digitization
Acoustic analysis of the
speech signal
Linguistic interpretation
Speech recognition
Speech features
Representation using features to develop models
Vocal tract time varying linear filter
Glottal pulse or noise generator signal sources
Time varying character of speech process is captured
by performing the spectral analysis
Short time analysis and repeating the analysis
periodically.
Mel-freqency cepstral
co-efficients (MFCCs)
MFCC is based on human

hearing perceptions which
cannot perceive frequencies
over 1Khz.
MFCC has two types of filter
which are spaced linearly at
low frequency below 1000 Hz
logarithmic spacing above
1000Hz
Suitability of features
Features from MFCCs are well suited to density
function models such as mixture models or Hidden
Markov Model (HMMs).
cepstra are lower order fourier co-efficients
Variations induced by the pulsed nature of vocal
excitation has minimal effect
Speaker Models
Neural network
Support vector machine (SVM)
Gaussian mixture model (GMM)
Hidden markov model (HMM)
The choice of model depends on circumstance and

specific application
Amount of data available and nature of speaker
verification problem influence choice of model.
Gaussian mixture model
GMM - Example
0.5
0.4
Component 1
Component 2
p (x)
0.3
0.2
0.1
0
-5
0.5
0.4
10
10
Mixture Model
p (x)
0.3
0.2
0.1
0
-5
0.5
0.4
Component 1
Component 2
p (x)
0.3
0.2
0.1
0
-5
0.5
0.4
10
10
Mixture Model
p (x)
0.3
0.2
0.1
0
-5
p (x)
1.5
Component Models
0.5
0
-5
0.5
0.4
10
10
Mixture Model
p (x)
0.3
0.2
0.1
0
-5
Gaussian Mixture Example:

Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
Gaussian Mixture Example:

Start
After First Iteration
After 2nd Iteration
After 3rd Iteration
After 4th Iteration
After 5th Iteration
After 6th Iteration
After 20th Iteration
Model - training
Depending on the amount of data available
Class conditional training
Adaptive training
Discriminative training
Class conditional modelingmaximum likelihood training
CCM-ML (Cntd..)
Class condition modelingdiscriminative training

ML training model works fine if the training data
is sufficient
For inadequate training data, with no loss in
performance , with CCM - go for discriminative
training.
This training works with important question
How should I select the parameters of my model
or models such that we maximize the performance
on our goal of speaker separability?
CMM-DT (cntd..)
To meet this, we move away from ML criterion
Performance of SV system is measured by
Receiver Operating Characteristics (ROC) or
detection error tradeoff (DET) curve.
Model parameters are directly trained to improve
the performance even working with GMM or
HMM.
CMM via Adaptation

The process of creating speaker models via
adaptation starts with a generic model for a
speaker and uses the data collected from the target
speaker to tailor the generic model to the speaker.
Generic model prior probability distribution
Generic model is often referred as universal
background model (UBM) is typically modeled
with GMM and trained with speaker data.
Other class conditional models
Inherently discriminative approaches
Managing variability
Channel Normalisation
Mitigating the effect of the channel based on its
inherent characteristics.
It can be modeled by linear time-invariant filter with
cepstral coefficients
C r,n = C s,n +
C ch,n
Cepstral mean substraction (CMS) one method of

normalisation is quite effective.
But removes some of the speaker information.
Sometimes features shifts in feature space, but it can
be avoided by CMS. However new features that are
produced that are less sensitive to channel
variability.
Channel normalisation other methods

(approach 1)
Whether there are features that are highly robust
to channel variation?
Features must be highly robust to channel variations
and also carry significant amount of speakerdependent information.
Format-related features Format is location of peak
in short spectrum i.e. resonant freq. of vocal tract.
Variations in glottal pulse timings and fundamental
frequency do carry speaker information and have
channel invariance
Difficult to extract these features and model. (Murthy
et al.)

(approach 2)
To treat channel as random disturbance and
integrate its variability directly into scoring
process of the speaker.
Estimating probablistic model for channel a
challenge.
Gaussian random vector is able to compute this
effect.

(approach 3)
To construct a channel dector
Compare with channel used at the time of
enrollment
While modeling speaker, maximum a posteriori
(MAP) adaptation methods are employed
(Teunen et al.)
Faces mismatched channel problem. But effective
than normalisation technique.
Score normalisation
Normalising the scores generated after transmission is over.
Constraining the text
Measuring performance
Performance of SV systen us measured by its prob. of false
dismissal versus the prob. of false acceptance at a given
threshold value.
By changing threshold, with collection of data DET curve is
obtained.
DET for one single user is discontinuous.
National institute of standards and technology (NIST) in their
annual evaluation, collected data from different speakers and
normalised to create composite DET.
For a system with speaker or channel dependent thresholds
are used, then normalisation could lead to inappropriate DET.
The speaker dependent threshold (no channel detection) a
more appropriate way is to combine scores
Measuring performance (Cntd....)

In such cases for a given PFA average of all PFM is found
and DET is plotted.
DET or Receiver operating characteristics (ROC) provide a
great deal of information on SV system. EER quantifies in
terms of number.
Another numerical value is detection cost function (DCF)
involves assigning cost to two different types of errors.
There are so many factors affect performance of system
Amount of training data

Duration of the evaluation
The variability
Types of channels employed
Protocols employed
Constrains on the text
Alternative approaches
1. Speech recognition approaches
Cepstra features extracted from speech were used and not
exploited higher level information (phonetics).
Relying primarily on individual vocal production
mechanism.
In verification process the discriminative gain was high
However extracting them from communication channel
reduced the accuracy.
Improvement in both speech and phonetic recognisers this approach gained importance.
Alternative approaches (Cntd...)

Dragon system focused on the recognisers ability to
provide higher accoustic knowledge about speakers.
For training Baum-welch HMM procedure is followed.
Dragon approach fared well in all aspects but, fell short by
the state of art of GMM. - lack of test data.
Kimball et all gave text dependent recognition system
Uses Maximum likelihood linear regression (MLLR)
method.
Proved to be effective even on different channels.

2.
Words (and phonetic units) count
Gauvain et al. proposed speaker recongnition based on phone

recogniser.
Training of accoustic models for target speaker was by
adaptation and model HMM.
Transition permitted between phones and users language.
Doddington demonstrated that language patterns (by
frequently occurring bigrams) contain a great deal of speaker
specific information.
Andrew et al demonstrated word based recognition system by
combining language recognition with acoustic information.

3. Models exploring the shape of feature space
Statistical modeling through users likelihood ratios
Speaker model is characterised by single Gaussian pdf.
This model ignores fine structure of the speech, shape in
feature space.
Gish introduced eigenvalues of covariance matrices
Zilca et al consider the measures of shape and T-norm
scaling of scores in conjunction with channel detection to
produce - cellular phone environment.
This system requires less computations than GMM.

Automatic Speech Recognition

Uploaded by

Copyright:

Available Formats

Automatic Speech Recognition

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Speech Recognition

Uploaded by

Copyright:

Available Formats

Automatic Speech

How do humans do it?

How might computers do it?

MFCC is based on human

The choice of model depends on circumstance and

Gaussian mixture model

Gaussian Mixture Example:

After First Iteration

After 2nd Iteration

After 3rd Iteration

Gaussian Mixture Example:

After First Iteration

After 2nd Iteration

After 3rd Iteration

After 4th Iteration

After 5th Iteration

After 6th Iteration

After 20th Iteration

Class conditional modelingmaximum likelihood training

Class condition modelingdiscriminative training

CMM via Adaptation

Cepstral mean substraction (CMS) one method of

Channel normalisation other methods

Channel normalisation other methods

Channel normalisation other methods

Normalising the scores generated after transmission is over.

Constraining the text

Measuring performance (Cntd....)

Amount of training data

Alternative approaches (Cntd...)

Alternative approaches (Cntd...)

Words (and phonetic units) count

Gauvain et al. proposed speaker recongnition based on phone

Alternative approaches (Cntd...)

You might also like