Tech Seminar
Tech Seminar
Tech Seminar
CHAPTER 1 INTRODUCTION
DEPARTMENT OF ECE
INTRODUCTION
Since the beginning of the 21st century, a revolutionary development began in technical and communication fields. These developments affected several areas such as medicine, industry, and even translation. There were a lot of problems and difficulties which faced this generation, however, People searched for new ideas that save time, effort, and money. When talking about translation, many has established computer programs electronic tools, and websites to service this area, for instance, online translation sites, translating programs, electronic dictionaries, corps. etc. One area of interest is Speech Recognition.
Speech recognition is the process of capturing spoken words using microphone or telephone and converting them into a digitally stored set of words .Speech recognition reduces the overhead caused by alternate communication methods. Speech has not been used much in the field of electronics and computers due to the complexity and variety of speech signals and sounds. However, with modern processes, algorithms, and methods we can process speech signals easily and recognize the text.
DEPARTMENT OF ECE
1.1OBJECTIVE
In this an on-line speech-to-text engine, implemented as a system-on-aprogrammablechip (SOPC) solution. The system acquires speech at run time through a microphone and processes the sampled speech to recognize the uttered text. Here hidden Markov model (HMM) is used for speech recognition, which converts the speech to text. The recognized text can be stored in a file on a PC that connects to an FPGA on a development board using a standard RS-232 serial cable. Our speech-to-text system directly acquires and converts speech to text. It can supplement other larger systems, giving users a different choice for data entry. A speech-to-text system can also improve system accessibility by providing data entry options for blind, deaf, or physically handicapped users.
DEPARTMENT OF ECE
DEPARTMENT OF ECE
Pulse Code Modulation (PCM) is an extension of PAM wherein each analogue sample value is quantized into a discrete value for representation as a digital code word. Thus, as shown below, a PAM system can be converted into a PCM system by adding a suitable analogue-to-digital (A/D) converter at the source and a digital-to-analogue (D/A) converter at the destination.
Modulator Analogue Input Sampler Parallel to Serial Converter Digital Pulse Generator PCM Output
A to D Converter
Binary Coder
LPF
Fig 2.1 PCM Modulation and Demodulation In PCM the speech signal is converted from analogue to digital form. PCM is standardised for telephony by the ITU-T (International Telecommunications Union Telecoms, a branch of the UN), in a series of recommendations called the G series. For example the ITU-T recommendations for out-of-band signal rejection in PCM voice coders require that 14 dB of attenuation is provided at 4 kHz. Also, the ITU-T transmission quality specification for telephony terminals require that the frequency response of the handset microphone has a sharp roll-off from 3.4 kHz.
In quantization the levels are assigned a binary codeword. All sample values falling between two quantization levels are considered to be located at the centre of the quantization interval. In this manner the quantization process introduces a certain amount of error or distortion into the signal samples. This error known as quantization noise, is minimised by establishing a large number of small quantization intervals. Of course, as the number of quantization intervals increase, so must the number or bits increase to uniquely identify the
DEPARTMENT OF ECE
quantization intervals. For example, if an analogue voltage level is to be converted to a digital system with 8 discrete levels or quantization steps three bits are required. In the ITU-T version there are 256 quantization steps, 128 positive and 128 negative, requiring 8 bits. A positive level is represented by having bit 8 (MSB) at 0, and for a negative level the MSB is 1.
DEPARTMENT OF ECE
processor and the custom hardware must be optimally designed such that neither is idle or under-utilized.
A Nios II processor system is equivalent to a microcontroller or computer on a chipthat includes a processor and a combination of peripherals and memory on a single chip. A Nios II processor system consists of a Nios II processor core, a set of on-chip peripherals, on-chip
CVSR COLLEGE OF ENGINEERING DEPARTMENT OF ECE
memory, and interfaces to off-chip memory, all implemented on a single Altera device. Like a microcontroller family, all Nios II processor systems use a consistent instruction set and programming model.
The DE2 board has many features that allow the user to implement a wide range of designed circuits, from simple circuits to various multimedia projects.
The following hardware is provided on the DE2 board Altera Cyclone II 2C35 FPGA device Altera Serial Configuration device - EPCS16 USB Blaster (on board) for programming and user API control; both JTAG and Active Serial(AS) programming modes are supported 512-Kbyte SRAM 8-Mbyte SDRAM 4-Mbyte Flash memory (1 Mbyte on some boards) SD Card socket 4 pushbutton switches
DEPARTMENT OF ECE
18 toggle switches 18 red user LEDs 9 green user LEDs 50-MHz oscillator and 27-MHz oscillator for clock sources 24-bit CD-quality audio CODEC with line-in, line-out, and microphone-in jacks VGA DAC (10-bit high-speed triple DACs) with VGA-out connector TV Decoder (NTSC/PAL) and TV-in connector 10/100 Ethernet Controller with a connector USB Host/Slave Controller with USB type A and type B connectors RS-232 transceiver and 9-pin connector PS/2 mouse/keyboard connector IrDA transceiver Two 40-pin Expansion Headers with diode protection
In addition to these hardware features, the DE2 board has software support for standard I/O interfaces and a control panel facility for accessing various components. Also, software is provided for a number of demonstrations that illustrate the advanced capabilities of the DE2 board.
DEPARTMENT OF ECE
10
DEPARTMENT OF ECE
11
Through Microphone
Speech acquisition
Speech preprocessing
Text storage
External Hardware
DEPARTMENT OF ECE
12
SPEECH ACQUISITION During speech acquisition, speech samples are obtained from the speaker in real time and stored in memory for preprocessing. The microphone input port with the audio codec receives the signal, amplifies it, and converts it into 16-bit PCM digital samples at a sampling rate of 8 KHz. The system needs a parallel/serial interface to the Nios II processor and an application running on the processor that acquires and stores data in memory. The received samples are stored into memory on the Altera Development and Education (DE2) development board.
The codec requires initial configuration, which is performed using custom hardware implemented in the Altera Cyclone II FPGA on the board. The audio codec provides a serial communication interface, which is connected to a UART. We used SOPC Builder to add the UART to the Nios II processor to enable the interface. The UART is connected to the processor through the Avalon bus. The C application running on the HAL transfers data from the UART to the SDRAM. Direct memory access (DMA) transfers data efficiently and quickly, and we may use it instead in future designs. SPEECH PREPROCESSING The speech signal consists of the uttered digit along with a pause period and background noise. Preprocessing reduces the amount of processing required in later stages. Generally, preprocessing involves taking the speech samples as input, blocking the samples into frames, and returning a unique pattern for each sample, as described in the following steps.
1. The system must identify useful or significant samples from the speech signal. To accomplish this goal, the system divides the speech samples into overlapped frames.
DEPARTMENT OF ECE
13
2. The system checks the frames for voice activity using endpoint detection and energy threshold calculations. 3. The speech samples are passed through a pre-emphasis filter. 4. The frames with voice activity are passed through a Hamming window. 5. The system performs autocorrelation analysis on each frame. 6. The system finds linear predictive coding (LPC) coefficients using the Levinson and Durbin algorithm. 7. From the LPC coefficients, the system determines the cepstral coefficients and weighs them using a tapered window. The cepstral coefficients serve as feature vectors.
Pre-emphasis filter
Windowing on frame
Auto correlation
LPC/Cepstral analysis
DEPARTMENT OF ECE
14
VOICE ACTIVITY DETECTION The system uses the endpoint detection algorithm to find the start and end points of the speech. The speech is sliced into frames that are 450 samples long. Next, the system finds the energy and number of zero crossings of each frame. The threshold energy and zero crossing value is determined based on the computed values and only frames crossing the threshold are considered, removing most background noise. We include a small number of frames beyond the starting and ending frames so that we do not miss starting or ending parts that do not cross the threshold but may be important for recognition. PRE-EMPHASIS The digitized speech signal s(n) is put through a low-order LPF to flatten the signal spectrally and make it less susceptible to finite precision effects later in the signal processing. The filter is represented by the equation H(z) = 1 - az-1 where a is 0.9375. FRAME BLOCKING Speech frames are formed with a duration of 56.25 ms (N = 450 sample length) and an overlap of 18.75 ms (M = 150 sample length) between adjacent frames. The overlapping ensures that the resulting LPC spectral estimates are correlated from frame to frame and are quite smooth. Xq(n) = s(Mq + n) with n = 0 to N - 1 and q = 0 to L - 1, where L is the number of frames. WINDOWING Hamming window is used to each frame to minimize signal discontinuities at the beginning and end of the frame according to the equation: xq(n) = xq(n). w(n) where w(n) = 0.54 = 0.46 cos(2n/N - 1).
DEPARTMENT OF ECE
15
The Hidden Markov Model(HMM) is a powerful statistical tool for modeling generative sequences that can be characterised by an underlying process generating an observable sequence. HMMs have found application in many areas interested in signal processing, and in particular speech processing, but have also been applied with success to low level Natural language processing( NLP) tasks such as part-of-speech tagging, phrase chunking, and extracting target information from documents. Andrei Markov gave his name to the mathematical theory of Markov processes in the early twentieth century, but it was Baum and his colleagues that developed the theory of HMMs in the 1960s[2]. HMM TRAINING An important part of speech-to-text conversion using pattern recognition is training. Training involves creating a pattern representative of the features of a class using one or more test patterns that correspond to speech sounds of the same class. The resulting pattern (generally called a reference pattern) is an example or template, derived from some type of averaging technique. It can also be a model that characterizes the reference pattern statistics.
A model commonly used for speech recognition is the HMM, which is a statistical model used for modeling an unknown system using an observed output sequence. The system trains the HMM for each digit in the vocabulary using the Baum-Welch algorithm. The codebook index created during preprocessing is the observation vector for the HMM model.
After preprocessing the input speech samples to extract feature vectors, the system builds the codebook. The codebook is the reference code space that we can use to compare input feature vectors. The weighted cepstrum matrices for various users and digits are compared with the codebook. The nearest corresponding codebook vector indices are sent to the Baum-Welch algorithm for training an HMM model.
The HMM characterizes the system using three matrices AThe state transition probability distribution. BThe observation symbol probability distribution. nThe initial state distribution.
DEPARTMENT OF ECE
16
Any digit is completely characterized by its corresponding A, B, and n matrices. The A, B, and n matrices are modeled using the Baum-Welch algorithm, which is an iterative procedure (we limit the iterations to 20). The Baum-Welch algorithm gives 3 matrices for each digit corresponding to the 3users with whom we created the vocabulary set. The A, B, and n matrices are averaged over the users to generalize them for user-independent recognition.
For the design to recognize the same digit uttered by a user for which the design has not been trained, the zero probabilities in the B matrix are replaced with a low value so that it gives a non-zero value on recognition. To some extent, this arrangement overcomes the problem of less training data. Training is a one-time process. Due to the complexity and resource requirements, it is performed using standalone PC application software that we created by compiling our C program into an executable. For recognition, we compile the same C program but target it to run on the Nios II processor instead. We were able to accomplish this cross-compilation because of the wide support for the C language in the Nios II processor IDE.
The C program running on the PC takes the digit speech samples from a MATLAB output file and performs preprocessing, feature vector extraction, vector quantization, BaumWelch modeling, and averaging, which outputs the normalized A, B, and n matrices for each digit. The normalized A, B, and n matrices are then embedded in the recognition C program code and stored in the DE2 development boards SDRAM using the Nios II IDE. BAUM-WELCH ALGORITHM
Baum-Welch algorithm is an instance of a general algorithm, the ExpectationMaximisation algorithm, which maximises the probability of observations depending on hidden data. The algorithm requires specifying the number of states n of the learnt model .The algorithm finds a local maximum in the parameter space of n-state HMMs, rather than a global maximum.
DEPARTMENT OF ECE
17
preprocessing
Weighted cepstrum
Train the HMM for Each Digit for Each User & Average the Parameters (A, B, ) over All Users
18
Recognition or pattern classification is the process of comparing the unknown test pattern with each sound class reference pattern and computing a measure of similarity (distance) between the test pattern and each reference pattern. The digit is recognized using a maximum likelihood estimate, such as the Viterbi decoding algorithm, which implies that the digit whose model has the maximum probability is the spoken digit.
Preprocessing, feature vector extraction, and codebook generation are same as in HMM training. The input speech sample is preprocessed and the feature vector is extracted. Then, the index of the nearest codebook vector for each frame is sent to all digit models. The model with the maximum probability is chosen as the recognized digit.
After preprocessing in the Nios II processor, the required data is passed to the hardware for Viterbi decoding. Viterbi decoding is computationally intensive so we implemented it in the FPGA for better execution speed, taking advantage of hardware/software co-design. We wrote the Viterbi decoder in Verilog HDL and included it as a custom instruction in the Nios II processor. Data passes through the dataa and datab ports and the prefix port is used for control operations. The custom instruction copies or adds two floating-point numbers from dataa and datab, depending on the prefix input. The output (result) is sent back to the Nios II processor for further maximum likelihood estimation.
VITERBI ALGORITHM
The Viterbi
algorithm is
a dynamic
finding
the
most likely sequence of hidden states called the Viterbi path that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models. The forward algorithm is a closely related algorithm for computing the probability of a sequence of observed events. These algorithms belong to the realm of probability theory. The algorithm makes a number of assumptions
DEPARTMENT OF ECE
19
First, both the observed events and hidden events must be in a sequence. The sequence is often temporal, i.e. in time order of occurrence. Second, these two sequences need to be aligned: an instance of an observed event needs to correspond to exactly one instance of a hidden event. Third, computing the most likely hidden sequence (which leads to a particular state) up to a certain point t must depend only on the observed event at point t, and the most likely sequence which leads to that state at point t 1.
Preprocessing
Find the Index of the Nearest Code Book (in a Euclidian Sense) Vector for Each Frame (Vector) of the Input Speech
Trained Digit Model for All Digits for All Users
Find the Model with the Maximum Probability & the Corresponding Digit
Recognized Digit
DEPARTMENT OF ECE
20
TEXT STORAGE Our speech-to-text conversion system can send the recognized digit to a PC via the serial, USB, or Ethernet interface for backup or archiving. For our testing, we used a serial cable to connect the PC and RS-232 port on the DE2 board. The Nios II processor on the DE2 board sends the digital speech data to a PC; a target program running on the PC receives the text and writes it to the disk.
We wrote the PC program using Visual Basic 6 (VB) using a Microsoft serial port Control. The VB program must be run in the background for the PC to receive the data and write it to the hard disk. The Windows HyperTerminal software or any other RS-232 serial communication receiver could also be used to receive and view the data. The serial port communication runs at 115,200 bits per second (bps) with 8 data bits, 1 stop bit, and no parity. Handshaking signals are not required. The speech-to-text conversion system can also operate as a standalone network device using an Ethernet interface for PC communication and appropriate speech recognition server software designed for the Nios II processor.
DEPARTMENT OF ECE
21
start
Y
Ask Filename
Open File
Y
Close File
Stop
DEPARTMENT OF ECE
22
3.3 WORKING
The microphone input port with the audio codec receives the signal, amplifies it, and converts it into 16-bit PCM digital samples at a sampling rate of 8 KHz. And it transmits to the Nios II processor in the FPGA through the communication interface.
Generally, a speech signal consists of noise-speech-noise. The detection of actual speech in the given samples is important. The speech signal is divided into frames of 450 samples each with an overlap of 300 samples, i.e., two-thirds of a frame length. The speech is separated from the pauses using voice activity detection (VAD) techniques. The received samples are stored into memory on the Altera Development and Education (DE2) development board.
The speech signal consists of the uttered digit along with a pause period and background noise. Preprocessing involves taking the speech samples as input, blocking the samples into frames, and returning a unique pattern for each sample. The system performs speech analysis using the linear predictive coding (LPC) method. From the LPC coefficients we get the weighted cepstral coefficients and cepstral time derivatives, which form the feature vector for a frame. Then, the system performs vector quantization using a vector codebook. The resulting vectors form the observation sequence. For each word in the vocabulary, the system builds an Hidden Marcov Model(HMM ) and trains the model during the training phase. The training steps, from VAD to HMM model building, are performed using PC-based C programs. The resulting HMM models on to an FPGA for the recognition phase.
In the recognition phase, the speech is acquired dynamically from the microphone through a codec and is stored in the FPGAs memory. These speech samples are preprocessed, and the probability of getting the observation sequence for each model is calculated. The uttered word is recognized based on a maximum likelihood estimation.
DEPARTMENT OF ECE
23
Sound recorder MATLAB version 6 C DevC++ with gcc and gdb Quartus II version 5.1 SOPC Builder version 5.1 Nios II processor Nios II IDE version 5.1 MegaCore IP library Altera Development and Education (DE2) board Microphone and headset RS-232 cable
DEPARTMENT OF ECE
24
CHAPTER 4 APPLICATIONS
DEPARTMENT OF ECE
25
APPLICATIONS
Interactive voice response system (IVRS) Voice-dialing in mobile phones and telephones Hands-free dialing in wireless bluetooth headsets PIN and numeric password entry modules Automated teller machines (ATMs)
DEPARTMENT OF ECE
26
CHAPTER 5 REFERENCES
DEPARTMENT OF ECE
27
REFERENCES
1. Topic taken from seminartopics.co.in/ece-seminar-topics/ 2. Garg, Mohit. Linear Prediction Algorithms. Indian Institute of Technology, Bombay, India, Apr 2003. 3. Li, Gongjun and Taiyi Huang. An Improved Training Algorithm in Hmm-Based Speech Recognition.National Laboratory of Pattern Recognition. Chinese Academy of Sciences, Beijing. 4. Altera Nios ii Document
DEPARTMENT OF ECE