Automatic Speech Recognition
Automatic Speech Recognition
Automatic Speech Recognition
Recognition
/
Articulation produces
sound waves which
the ear conveys to the brain
for processing
Acoustic signal
Digitization
Acoustic analysis of the
speech signal
Linguistic interpretation
Speech recognition
Speech features
Representation using features to develop models
Vocal tract time varying linear filter
Glottal pulse or noise generator signal sources
Time varying character of speech process is captured
by performing the spectral analysis
Short time analysis and repeating the analysis
periodically.
Mel-freqency cepstral
co-efficients (MFCCs)
Suitability of features
Features from MFCCs are well suited to density
function models such as mixture models or Hidden
Markov Model (HMMs).
cepstra are lower order fourier co-efficients
Variations induced by the pulsed nature of vocal
excitation has minimal effect
Speaker Models
Neural network
Support vector machine (SVM)
Gaussian mixture model (GMM)
Hidden markov model (HMM)
GMM - Example
0.5
0.4
Component 1
Component 2
p (x)
0.3
0.2
0.1
0
-5
0.5
0.4
10
10
Mixture Model
p (x)
0.3
0.2
0.1
0
-5
0.5
0.4
Component 1
Component 2
p (x)
0.3
0.2
0.1
0
-5
0.5
0.4
10
10
Mixture Model
p (x)
0.3
0.2
0.1
0
-5
p (x)
1.5
Component Models
0.5
0
-5
0.5
0.4
10
10
Mixture Model
p (x)
0.3
0.2
0.1
0
-5
Model - training
Depending on the amount of data available
Class conditional training
Adaptive training
Discriminative training
CCM-ML (Cntd..)
CMM-DT (cntd..)
To meet this, we move away from ML criterion
Performance of SV system is measured by
Receiver Operating Characteristics (ROC) or
detection error tradeoff (DET) curve.
Model parameters are directly trained to improve
the performance even working with GMM or
HMM.
Managing variability
Channel Normalisation
Mitigating the effect of the channel based on its
inherent characteristics.
It can be modeled by linear time-invariant filter with
cepstral coefficients
C r,n = C s,n +
C ch,n
Score normalisation
Measuring performance
Performance of SV systen us measured by its prob. of false
dismissal versus the prob. of false acceptance at a given
threshold value.
By changing threshold, with collection of data DET curve is
obtained.
DET for one single user is discontinuous.
National institute of standards and technology (NIST) in their
annual evaluation, collected data from different speakers and
normalised to create composite DET.
For a system with speaker or channel dependent thresholds
are used, then normalisation could lead to inappropriate DET.
The speaker dependent threshold (no channel detection) a
more appropriate way is to combine scores
Alternative approaches
1. Speech recognition approaches
Cepstra features extracted from speech were used and not
exploited higher level information (phonetics).
Relying primarily on individual vocal production
mechanism.
In verification process the discriminative gain was high
However extracting them from communication channel
reduced the accuracy.
Improvement in both speech and phonetic recognisers this approach gained importance.