Speech Recognition Using FPGA Technology

My friends David and Kanwen, and I implemented a speech recognition system on an FPGA development board (Altera DE2 Board) for the Design Project course at McGill (ECSE 494). We did this in two step: first we wrote a prototype for the algorithm in MATLAB (I’ll maybe port it to Octave), and then we did the hardware description for the FPGA.

MATLAB Prototype

Inspired by the algorithm described in a site from the University of Toronto, we wrote two MATLAB scripts: train.m and recogniz.m.

train.m deals with the training phase, in which many versions of a sound (a spoken word for instance) are input and averaged in the frequency domain thus generating the sound’s “reference fingerprint”.

recogniz.m deals with the recognition phase, where a sound is input, translated to the frequency domain (i.e. Its fingerprint is generated), and compared to the reference fingerprint by computing the euclidean distance between them (as if both fingerprints where vectors).

Both scripts need to detect the beginning of the sound (i.e tell when the spoken word begins). They do so by averaging two adjacent 1024-sound-samples groups (in the time domains) and computing the difference between the averages. So, if there is a sudden increase in the sound’s amplitude, the difference will be significant and the sound is assumed to start after that sudden increase. The sound’s length is fixed to 1,024 s (see the picture below for more details)

Note that the scripts use 16-bit WAV files as input @ 22050 Hz (this is the default windows sound recorder output, since I could not do it in Linux because the mic did not wanted to work). The sound input is downsampled and quantized in order to get it down to 8 bit /sample @ 5 kHz for processing.

Also you might encounter problems if the sound file is too short (it should last for more than 1,1 s), or if its volume level is too low (this happens because the detector threshold is fixed).

Hardware Implementation

Once we had played enough with the MATLAB prototype parameters, we mapped the algorithm into combinational logic and finite state machines (FSM) by breaking it down into independent modules.

For more details about the hardware implementation and the project in general you can read the full project report. You may also want to see the slides for a presentation we did (below).

Unfortunately, I cannot post the project files (i.e. VHDL code).

Here is a little video demo, enjoy:

Note that all the documentation for this project was done using the very excellent OpenOffice.org.

84 thoughts on “Speech Recognition Using FPGA Technology

  1. Hello Mr. Carlos, I have not understood this phrase exists in the frequency content
    ”Since the length of a word is 1.024 s and the sound is sampled at 5 kHz, five 1024-points FFTs are required to fully characterize a single word.” plzzz explain me that.

    best regards.

  2. @jesska
    Sorry for the delay. This is quite simple: let us say a word (an actual sound) lasts for 1.024 seconds, and that the sound is sampled at 5 KHz (5 thousands times per second). This means that for a 1.024s sound there will be roughly 5000 samples. So, in order to compute the FFT on the sample we need a roughly 5000 points FFT, or five 1024-points FFTs.

    I hope this is clearer.

  3. @jesska
    As I said before: I do not have the code any more. It is in some obscure backup who knows where (that’s how organized I am). And also, I’m sure you can do it yourself. It is a simple mathematical formula (euclidean distance).

    Good Luck!

  4. Hello Carlos,
    Wats the major advantage in processing Speech in FPGA rather DSP ?
    Which compiler u gonna implement here in your project?

  5. I Just want the brief idea about the performance of speech recognition process using VHDL implementation.

  6. hi carlos,
    I read your report of your project on speech recognition.
    I would like if possible for me to understand the working principle of inputs / outputs of the block Memory Batch operator, and especially “start_addr”, “end_addr” and “done”.
    thank you.

  7. Hi,
    I have read your project, it is good.I just wanted to know that threshold value remains 0.05 only when you implemented it in fpga or it changes.

  8. Hi,
    Considery that one person trains the kit.So, FFT values of his speech will be stored in training mode.Now if during recognition speaker is a different person, then what will be the accuracy of recognition?

  9. hi carlos,
    I read your report of your project on speech recognition.
    I would like if possible for me to understand the working principle of inputs / outputs of the block Memory Batch operator, and especially “start_addr”, “end_addr” and “done”.
    thank you.

  10. Hi carlos,
    the sound length is exactly 1.024 sec or grater than that for input to training phase of speech recognition using matlab.i gave the input using microphone of wavelength 9 sec.after running the program it gives the error INDEX EXCEEDS MATRIX DlMENSION.

  11. hi carlos,
    speech recognition already present in operating sysems like vista.what is the importance of this project.because it hardware cost more.please clear my doubt……..
    i also done vhdl code for this project.please reply……

  12. @vasun
    Speech recognition performed in hardware is different from the software counterpart in the sense that it can be integrated in a single chip and work without a computer. Also, this project aim was mainly combine various disciplines learned in the electrical engineering curriculum in a single capstone project.

  13. hi carlitos,
    I read your project….i like it…and myself i want do the same as my final year project…

    i can understand the matlab codes…i ‘m new to Matlab and Quartus II …

    could u explain the steps from Matlab to QuartusII…

  14. hai,sir
    i’m from india..i already posted a comment..but i didn’t mentioned clearly about my doubts.

    i can understand the matlab programs..and i gave one wave file as input & i got the output as distance 0,word is recognized!…

    could u please guide me…wat can i do next sir…then after this i have to move to Quartus ii or else here itself steps are there sir…

    then sir with the help your documents…i sucessfully created Block design files for memory controller,distance module and mux….

    so please guide sir…wat are all steps i can do next…for your Kind Attention my
    email:[email protected]

  15. @Thiru
    I suggest you make sure you understand the algorithm in general. then, the implementation should be straight forward. It is very important to understand all the basic concepts before you worry about the details of the algorithm.

  16. Hi,
    Your project is good.But I didn’t understand the need of using FFT. Why was FFT used?
    Can’t we recognize speech without using FFT? What will be the effect of not using FFT?

  17. Can you please explain me the reason for your selection of Altera DE2 board among the numerous boards(eg vertex) available in market?

  18. Best you should make changes to the webpage title Speech Recognition Using FPGA Technology | Carlitos’ Contraptions to more specific for your subject you make. I loved the blog post nevertheless.

  19. Hello sir

    I did this project as my mini poject in M.Tech.

    Duration:Jan to May 2010…

    I tried this project on DE2 Board.It was partically executed.

    The problems are

    1.FFT code is not synthesized.
    2.FIngerprint is continously changing.

    We showed our project to NXP Representative he had appricipated us.

    I will send u the code if u send me ur mail id

    i wanna work on it.

    Regards
    Vikas Billa
    Harini Nellutla

  20. is it possible for me to view someone who has done the verilog coding for this project ?
    thank you

  21. hi carlitos,

    i need some help regarding this project.i will come straight to the point.
    its the sound fetcher module,what are the 2 enables ENABLE_shiftreg and ENABLE_8bitFF
    used for?bcoz i had gone through quartushelp and i found out that for the D_FFs if omitted
    the clock_enable input is default to 1.
    one more question,the aim of the downsampler is to sample the data down from 48 to 5Khz
    so i jst made the pulse10000 to act like a 5Khz clock and directly fed it to the clock of the downsampler D_FF.
    (i put both the enables of the quantizer and downsampler to ‘1’ or high.and the enable of the shift_reg to be enabled for the 1st 23 bits on the BCLK for the left channel of the ADCLRCK )
    am i thinking wrong,i do not need any code,i jst want ur advice.any help will be greatly appreciated.
    (i am using a DE2 board)

    thanking you,
    fester.

  22. hi Mr carlos,
    please can you explain me what mean master block and slave block
    best regards.

  23. Hi Mr Carlos..
    When I run the train.m file, it shows:

    ??? Index exceeds matrix dimensions.
    Error in ==> train at 99
    s = xq(ptr:int32(ptr+l*sf/F)); % Store the detected sound in ‘s’.

    Can u please explain the mistake that I made?
    Thanx in advance ..
    Best Regards

  24. Hello Mr. Carlos,
    Are you really sampling the sound at 5 KHz?
    At 5 KHz you can have the highest frequency of 2500 Hz.
    Isn’t it too low? (The second formant can come up to 3500 Hz, and the fifth formant can come up to 5000 Hz. If you save only low part of spectrum, below 2500, the voice sounds become hardly distinguishable.)

Leave a Comment

Your email address will not be published. Required fields are marked *