Pronunciation: the link between phonemes and words

The importance of different pronunciations in word recognition

pronunciation: Speech Production Acquisition Chain


We saw previously in the article "Phoneme detection, a key step in speech recognition" how the system was able to recognize phonemes. We also concluded that the system could recognize words that do not exist. In this article we will focus on pronunciation and describe how to force the detection of valid words.

The combination between acoustic model and pronunciation model

The pronunciation of a word consists of a series of phonemes, and the same word can be pronounced in different ways. The Pronunciation Model (also known as a Pronunciation Dictionary or Lexicon) lists all pronunciations of all words that the system will be able to recognize.
In order to take into account the
speech speed of a word, a mathematical model is used to analyze the variable durations of phonemes. The Hidden Markov model or HMM, is a probabilistic automaton that allows to take into account the temporality of the audio signal, thanks in particular to the transition on the same phoneme. Each internal probability consists of the phoneme recognition we have seen before.
This combination of the acoustic model and the pronunciation model is called the
acoustic-phonetics model. This model allows to assign an HMM to each word. During the learning phase, probabilities of transition between states (here phonemes) are calculated and stored. During the decoding phase, the probabilities that have been pre-calculated are used.

pronunciation of the word tomato

                                                                                                             Figure 1: HMM of the word "Tomato"

The advantage is that, thanks to this list of HMM, which is the pronunciation dictionary, we are able to recognize only words. However, there is a flaw: acronyms and proper names that do not belong to the pronunciation dictionary cannot be predicted.
Acoustic-phonetics decoding alone does not allow a sentence to be detected. For the moment, the system is able to predict word sequences that are not correct. For example,"you whereas he or but until however" is possible however the sentence does not make sense.
We have seen how to
reduce the Phoneme Error Rate (PER) by using pronunciation dictionary.
In a
future article, we will look at the language model, which allows us to add consistency to the predicted word sequences.


Authôt. You speak. We write