Text and Speech CCS369-UNIT 5
Text and Speech CCS369-UNIT 5
Text and Speech CCS369-UNIT 5
Reshma/AP/CSE
SPEECH RECOGNITION
Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-
to-text, is a capability which enables a program to process human speech into a written format. While it’s
commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal
format to a text one whereas voice recognition just seeks to identify an individual user’s voice.
peech recognition, or speech-to-text, is the ability of a machine or program to identify words spoken aloud and
convert them into readable text. Rudimentary speech recognition software has a limited vocabulary and may only
identify words and phrases when spoken clearly. More sophisticated software can handle natural speech, different
accents and various languages.
Speech recognition uses a broad array of research in computer science, linguistics and computer engineering.
Many modern devices and text-focused programs have speech recognition functions in them to allow for easier
or hands-free use of a device.
Speech recognition and voice recognition are two different technologies and should not be confused:
• Speech recognition is used to identify words in spoken language.
• Voice recognition is a biometric technology for identifying an individual's voice.
recognition can also be found in word processing applications like Microsoft Word, where users can dictate words
to be turned into text.
Education. Speech recognition software is used in language instruction. The software hears the user's speech and
offers help with pronunciation.
Customer service. Automated voice assistants listen to customer queries and provides helpful resources.
Healthcare applications. Doctors can use speech recognition software to transcribe notes in real time into
healthcare records.
Disability assistance. Speech recognition software can translate spoken words into text using closed captions to
enable a person with hearing loss to understand what others are saying. Speech recognition can also enable those
with limited use of their hands to work with computers, using voice commands instead of typing.
Court reporting. Software can be used to transcribe courtroom proceedings, precluding the need for human
transcribers.
Emotion recognition. This technology can analyze certain vocal characteristics to determine what emotion the
speaker is feeling. Paired with sentiment analysis, this can reveal how someone feels about a product or service.
Hands-free communication. Drivers use voice control for hands-free communication, controlling phones, radios
and global positioning systems, for instance.
• Language weighting. This feature tells the algorithm to give special attention to certain words, such
as those spoken frequently or that are unique to the conversation or subject. For example, the software
can be trained to listen for specific product references.
• Acoustic training. The software tunes out ambient noise that pollutes spoken audio. Software
programs with acoustic training can distinguish speaking style, pace and volume amid the din of many
people speaking in an office.
• Speaker labeling. This capability enables a program to label individual participants and identify their
specific contributions to a conversation.
• Profanity filtering. Here, the software filters out undesirable words and language.
What are the different speech recognition algorithms?
The power behind speech recognition features comes from a set of algorithms and technologies. They include the
following:
• Hidden Markov model. HMMs are used in autonomous systems where a state is partially observable
or when all of the information necessary to make a decision is not immediately available to the sensor
(in speech recognition's case, a microphone). An example of this is in acoustic modeling, where a
program must match linguistic units to audio signals using statistical probability.
• Natural language processing. NLP eases and accelerates the speech recognition process.
• N-grams. This simple approach to language models creates a probability distribution for a sequence.
An example would be an algorithm that looks at the last few words spoken, approximates the history
of the sample of speech and uses that to determine the probability of the next word or phrase that will
be spoken.
• Artificial intelligence. AI and machine learning methods like deep learning and neural networks are
common in advanced speech recognition software. These systems use grammar, structure, syntax and
composition of audio and voice signals to process speech. Machine learning systems gain knowledge
with each use, making them well suited for nuances like accents.
What are the advantages of speech recognition?
There are several advantages to using speech recognition software, including the following:
• Machine-to-human communication. The technology enables electronic devices to communicate with
humans in natural language or conversational speech.
• Readily accessible. This software is frequently installed in computers and mobile devices, making it
accessible.
• Easy to use. Well-designed software is straightforward to operate and often runs in the background.
• Continuous, automatic improvement. Speech recognition systems that incorporate AI become more
effective and easier to use over time. As systems complete speech recognition tasks, they generate more
data about human speech and get better at what they do.
What are the disadvantages of speech recognition?
While convenient, speech recognition technology still has a few issues to work through. Limitations include:
• Inconsistent performance. The systems may be unable to capture words accurately because of variations
in pronunciation, lack of support for some languages and inability to sort through background noise.
Ambient noise can be especially challenging. Acoustic training can help filter it out, but these programs
aren't perfect. Sometimes it's impossible to isolate the human voice.
• Speed. Some speech recognition programs take time to deploy and master. The speech processing may
feel relatively slow.
• Source file issues. Speech recognition success depends on the recording equipment used, not just the
software.
Acoustic Modelling
Acoustic modelling of speech typically refers to the process of establishing statistical representations for the
feature vector sequences computed from the speech waveform. Hidden Markov Model (HMM) is one most
common types of acoustic models. Modern speech recognition systems use both an acoustic model and
a language model to represent the statistical properties of speech. The acoustic model models the relationship
between the audio signal and the phonetic units in the language. The language model is responsible for modeling
the word sequences in the language. These two models are combined to get the top-ranked word sequences
corresponding to a given audio segment.
Audio can be encoded at different sampling rates (i.e. samples per second – the most common being: 8, 16, 32,
44.1, 48, and 96 kHz), and different bits per sample (the most common being: 8-bits, 16-bits, 24-bits or 32-bits).
Speech recognition engines work best if the acoustic model they use was trained with speech audio which was
recorded at the same sampling rate/bits per sample as the speech being recognized.
The limiting factor for telephony based speech recognition is the bandwidth at which speech can be transmitted.
For example, a standard land-line telephone only has a bandwidth of 64 kbit/s at a sampling rate of 8 kHz and 8-
bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony based
speech recognition, acoustic models should be trained with 8 kHz/8-bit speech audio files.
In the case of Voice over IP, the codec determines the sampling rate/bits per sample of speech transmission.
Codecs with a higher sampling rate/bits per sample for speech transmission (which improve the sound quality)
necessitate acoustic models trained with audio data that matches that sampling rate/bits per sample.
For speech recognition on a standard desktop PC, the limiting factor is the sound card. Most sound cards today
can record at sampling rates of between 16 kHz-48 kHz of audio, with bit rates of 8 to 16-bits per sample, and
playback at up to 96 kHz.
As a general rule, a speech recognition engine works better with acoustic models trained with speech audio data
recorded at higher sampling rates/bits per sample. But using audio with too high a sampling rate/bits per sample
can slow the recognition engine down. A compromise is needed. Thus for desktop speech recognition, the current
standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16bits per sample.
Acoustic modeling is a crucial component in the field of automatic speech recognition (ASR) and various other
applications involving spoken language processing. It is the process of creating a statistical representation of the
relationship between acoustic features and phonemes, words, or other linguistic units in a spoken language.
Acoustic models play a central role in converting spoken language into text and are a key part of the larger ASR
system. Here's how acoustic modeling works:
1. **Feature Extraction:** The process starts with capturing audio input, which is typically sampled at a high
rate. Feature extraction is performed to convert this raw audio into a more compact and informative representation.
Common acoustic features include Mel-frequency cepstral coefficients (MFCCs) or filterbank energies. These
features capture the spectral characteristics of the audio signal over time.
2. **Training Data:** Acoustic modeling requires a significant amount of training data, typically consisting of
transcribed audio recordings. This data is used to establish statistical patterns between acoustic features and the
corresponding linguistic units (e.g., phonemes, words).
3. **Phoneme or State Modeling:** In traditional Hidden Markov Models (HMMs), which have been widely
used in ASR, the acoustic modeling process involves modeling phonemes or states. An HMM represents a
sequence of states, each associated with a specific acoustic observation probability distribution. These states
correspond to phonemes or sub-phonetic units.
4. **Building Gaussian Mixture Models (GMMs):** For each state or phoneme, a Gaussian Mixture Model
(GMM) is constructed. GMMs are a set of Gaussian distributions that model the likelihood of observing specific
acoustic features given a phoneme or state. These GMMs capture the variation in acoustic features associated
with each phoneme.
5. **Training the Models:** During training, the GMM parameters are estimated to maximize the likelihood of
the observed acoustic features given the transcribed training data. This training process adjusts the means and
covariances of the Gaussian components to fit the observed acoustic data.
6. **Decoding:** When transcribing new, unseen audio, the acoustic model is used in combination with language
and pronunciation models. The ASR system uses these models to search for the most likely sequence of phonemes
or words that best matches the observed acoustic features. Decoding algorithms like the Viterbi algorithm are
commonly used for this task.
7. **Integration:** The output of the acoustic model is combined with language and pronunciation models to
generate a final transcription or understanding of the spoken input.
Modern ASR systems have evolved beyond HMM-based approaches, with deep learning techniques, such as deep
neural networks (DNNs) and recurrent neural networks (RNNs), becoming more prevalent in acoustic modeling.
Deep learning models can directly map acoustic features to phonemes or words, bypassing the need for GMMs
and HMMs. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are often used for this
purpose. These deep learning models have significantly improved the accuracy of ASR systems, making them
more robust to various accents, noise, and speaking styles.
In summary, acoustic modeling is a crucial step in automatic speech recognition, responsible for establishing the
statistical relationship between acoustic features and linguistic units. This process enables the conversion of
spoken language into text, and advances in deep learning techniques have greatly improved the accuracy and
efficiency of acoustic models in ASR systems.
Feature Extraction
Feature extraction is a fundamental step in acoustic modeling for tasks like automatic speech recognition (ASR)
and speaker identification. Its primary goal is to convert the raw audio signal into a more compact and informative
representation that captures relevant acoustic characteristics. The choice of acoustic features greatly impacts the
performance of the acoustic model. Here are some common techniques for feature extraction in acoustic
modeling:
1. **Mel-Frequency Cepstral Coefficients (MFCCs):** MFCCs are one of the most widely used acoustic
features in ASR. They mimic the human auditory system's sensitivity to different frequencies. The MFCC
extraction process typically involves the following steps:
- Pre-emphasis: Boosts high-frequency components to compensate for the muffled low frequencies in speech.
- Framing: The audio signal is divided into short overlapping frames, often around 20-30 milliseconds in
duration.
- Windowing: Each frame is multiplied by a windowing function (e.g., Hamming window) to reduce spectral
leakage.
- Fast Fourier Transform (FFT): The power spectrum of each frame is computed using the FFT.
- Mel-filterbank: A set of triangular filters on the Mel-scale is applied to the power spectrum. The resulting
filterbank energies capture the distribution of energy in different frequency bands.
- Logarithm: The logarithm of filterbank energies is taken to simulate the human perception of loudness.
- Discrete Cosine Transform (DCT): DCT is applied to decorrelate the log filterbank energies and produce a set
of MFCC coefficients.
2. **Filterbank Energies:** These are similar to the intermediate step of MFCC computation but without the
logarithm and DCT steps. Filterbank energies are a set of values that represent the energy in different frequency
bands over time. They are often used in conjunction with MFCCs or as a simpler alternative when the benefits of
MFCCs are not required.
3. **Spectrogram:** The spectrogram is a visual representation of the spectrum of frequencies in the audio
signal over time. It is often used as a feature for tasks that benefit from a time-frequency representation, such as
music genre classification and environmental sound recognition.
4. **Pitch and Fundamental Frequency (F0):** Extracting pitch information can be important for certain
applications. Pitch is the perceived frequency of a sound and is often associated with prosody and intonation in
speech
5. **Linear Predictive Coding (LPC):** LPC analysis models the speech signal as the output of an all-pole
filter and extracts coefficients that represent the vocal tract's resonances. LPC features are used in speech coding
and sometimes ASR.
6. **Perceptual Linear Prediction (PLP) Cepstral Coefficients:** PLP is an alternative to MFCCs that
incorporates psychoacoustic principles, modeling the human auditory system's response more closely.
7. **Deep Learning-Based Features:** In recent years, deep neural networks have been used to learn features
directly from the raw waveform. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs)
can be used to capture high-level representations from audio data.
8. **Gammatone Filters:** These are designed to more closely mimic the response of the human auditory
system to different frequencies.
The choice of feature extraction method depends on the specific task and the characteristics of the data. For ASR,
MFCCs and filterbank energies are the most commonly used features. However, as deep learning techniques
become more prevalent in acoustic modeling, end-to-end systems that operate directly on raw audio data are
gaining popularity, and feature extraction is becoming integrated into the model architecture.
In signal processing, a filter bank (or filterbank) is an array of bandpass filters that separates the input signal into
multiple components, each one carrying a single frequency sub-band of the original signal.
HMM
A Hidden Markov Model (HMM) is a statistical model used for modeling sequential data, where the underlying
system is assumed to be a Markov process with hidden states. HMMs are widely used in various fields,
including speech recognition, natural language processing, bioinformatics, and more.
1. Markov Process:
• A Markov process, also known as a Markov chain, is a stochastic model that describes a system's
transitions from one state to another over discrete time steps.
• In a simple Markov chain, the future state of the system depends only on the current state and is
independent of all previous states. This property is called the Markov property.
2. Hidden States:
• In an HMM, there are two sets of states: hidden states and observable states.
• Hidden states represent the unobservable underlying structure of the system. They are
responsible for generating the observable data.
3. Observable Data:
• Observable states are the data that can be directly measured or observed.
• For example, in speech recognition, the hidden states might represent phonemes, while the
observable data are the audio signals.
4. State Transitions:
• An HMM defines the probabilities of transitioning from one hidden state to another. These
transition probabilities are often represented by a transition matrix.
• Transition probabilities can be time-dependent (time-inhomogeneous) or time-independent
(time-homogeneous).
5. Emission Probabilities:
• Emission probabilities specify the likelihood of emitting observable data from a particular hidden
state.
• In the context of speech recognition, these probabilities represent the likelihood of generating a
certain audio signal given the hidden state (e.g., a phoneme).
6. Initialization Probabilities:
• An HMM typically includes initial probabilities, which represent the probability distribution over
the initial hidden states at the start of the sequence.
7. Observations and Inference:
• Given a sequence of observations (observable data), the goal is to infer the most likely sequence
of hidden states.
• This is typically done using algorithms like the Viterbi algorithm, which finds the most probable
sequence of hidden states that generated the observations.
8. Learning HMM Parameters:
• Training an HMM involves estimating its parameters, including transition probabilities, emission
probabilities, and initial state probabilities.
• This can be done using methods like the Baum-Welch algorithm, which is a variant of the
Expectation-Maximization (EM) algorithm.
9. Applications:
• HMMs have a wide range of applications, such as speech recognition, where they can model
phonemes, natural language processing for part-of-speech tagging, bioinformatics for gene
prediction, and more.
10. Limitations:
• HMMs assume that the system is a first-order Markov process, which means it depends only on
the current state. More complex dependencies might require more advanced models.
• HMMs are also sensitive to their initial parameter estimates and might get stuck in local optima
during training.
In summary, Hidden Markov Models are a powerful tool for modeling and analyzing sequential data with
hidden structure. They are used in a variety of fields to uncover underlying patterns and make predictions based
on observed data.
HMM-DNN
The hybrid HMM-DNN approach in speech recognition make use of the properties like the strong learning power
of DNN and the sequential modelling activity of the HMM. As DNN accepts only fixed sized inputs it will be
difficult to deal with speech signals as they are variable length time varying signal. So in this approach HMM
deals with the dynamic characteristic of the speech signal and DNN is responsible for the observation probability.
Given the acoustic observations, each output neuron of DNN is trained to estimate the posterior probability of
continuous density HMM’s state. DNN when trained in the usual traditional way through supervised manner does
not produce good results and very difficult to get to an optimal point. When a set of data is given as input,
importance should be given to extract the variety of data rather than the quantity of data extracted because later
on a good classification can be made from this data.
DNN-HMM systems, also known as Deep Neural Network-Hidden Markov Model systems, are a type of
technology used in automatic speech recognition (ASR) and other sequential data modeling tasks. These systems
combine deep neural networks (DNNs) with Hidden Markov Models (HMMs) to improve the accuracy and
robustness of speech recognition and other related applications. Here's a detailed explanation of DNN-HMM
systems:
3. **Acoustic Modeling**:
- Acoustic modeling in ASR is the process of estimating the likelihood of observing a given acoustic feature
(e.g., a frame of audio) given a particular state in the HMM.
- In DNN-HMM systems, DNNs are used to model this likelihood. They take acoustic features as input and
produce the probability distribution over the set of states.
5. **Training**:
- DNN-HMM systems are trained using large datasets of transcribed speech. The DNNs are trained to minimize
the error between their predicted state probabilities and the true state labels in the training data.
- DNNs can be trained using supervised learning techniques, and backpropagation is used to update the model's
weights.
7. **Decoding**:
- During the decoding phase, DNN-HMM systems use algorithms like the Viterbi algorithm to find the most
likely sequence of hidden states (phonemes or subword units) that best explain the observed acoustic features.
8. **Benefits**:
- DNN-HMM systems have significantly improved ASR accuracy, especially in challenging environments with
background noise and variations in speech.
- They capture complex acoustic patterns and can model a wide range of speakers and accents effectively.
9. **Challenges**:
- Training deep neural networks requires large amounts of labeled data and significant computational resources.
- DNN-HMM systems can be complex to design and optimize, and there's a risk of overfitting the model to the
training data.
DNN-HMM systems have been a major breakthrough in ASR technology and have significantly improved the
accuracy of speech recognition systems, making them more practical for real-world applications, including voice
assistants, transcription services, and more.