Phoneme detection, a key step in speech recognition

The main stages of speech recognition: the detection of phonemes

phoneme speech recognition

                                                                                           Speech Production Acquisition Chain

Following the series of articles published in summer 2016 and presenting the fundamental components of the Speech Production Acquisition Chain, we will discuss in more detail how the system is able to make the link between:

– an audio file containing speech and, 
– the text spoken.

For a better understanding, we invite you to read the previous articles:

Now, we will describe the main steps to transcribe an audio file into text.

Phoneme recognition

Phoneme recognition is carried out using the acoustic model. The acoustic model is created using machine learning algorithms. The machine learning is divided into two phases:
training and testing.
The acoustic model is first calculated during the training phase, then the model is used during the decoding phase to transcribe the audio statement into text.

1 The training phase

During this learning phase, we use large audio volumes (several hundred hours), for which the data has been previously transcribed. These data make possible the link between an acoustic realization and a phoneme. For each phoneme, a large number of acoustic realizations will be studied: these different realizations can be variable because of noise, reverberation, different speakers, different phonetic contexts (previous phoneme and following phoneme)…
For example, if we take the case of phoneme [æ] (cat). Analysis of energy behaviour in time-frequency domain of a very large number of phonemes [æ], pronounced under different conditions, will allow the creation of a "general" model of [æ] using a mixture of
Gaussian Mixture Model (GMM).

phoneme occurences

                                                          Figure 1: Modeling phoneme [ae] using multiple occurrences of phoneme [ae].

As can be seen in Figure 1, the [æ] pronounced by different speakers are slightly different. This is due to variations in the “vowel space” that is specific to speakers.

2 Adaptation to speakers

In order to make the best use of our general [æ] model, we will have to adapt this model to unknown speakers during decoding (which automatically transcribes an audio file into text). Since there are a lot of adaptation methods, we'll just look at the basic principle.
The model of [æ], previously calculated during the learning phase,
will undergo a mathematical transformation of these parameters, such as translations and rotations so that the space of these parameters is closest to the space of the parameters of an unknown speaker. Once this transformation is completed, our general model will specialize in order to better recognize phonemes from the unknown speaker.

phoneme speaker                                                                                    Figure 2: Adaptation of the general model [ae] to the speaker x

Once our acoustic model is adapted, it is ready to use.

3 The test phase

Over time, we will analyze the behavior of the energy in the time-frequency marker of the audio file of which we would like to know the most likely pronounced phonemes. If the observation n is closer to the phoneme model [æ], then the phoneme [æ] will be the most likely pronounced phoneme.

phoneme detection
                                                                                       Figure 3: Principle of phoneme detection​

We have seen how the system is able to recognize a phoneme. However, phoneme detection is not always correct.

phoneme mistake

                                                               Figure 4: Current phoneme error rate on the TIMIT database  (read speech corpus)​

For the moment, the system is able to predict phoneme sequences that are not words. For example, [tʃ] [ɡ] [ð] [eɪ] (ch_g_th_ay ) is possible.

In a later article we will look at the pronunciation model, which will force the detection of a series of phonemes in order to recognize words only.


Authôt. You speak. We write