The study of the language model in speech recognition

Language model: the anticipation of coherent words

language model and speech recognition

We have seen in our previous articles how the system is able to recognize words, and the importance of pronunciation. We also concluded that the system could recognize a sequence of inconsistent words.
In this article we will focus on the language model and describe how to
force the detection of coherent sentences.

Phrase recognition using the language model

In order to force the detection of more coherent sentences, we use the language model.
The language model, just like the acoustic model, is built on a
statistical study. There are many methods, but here we will focus specifically on the n-gram model.
During the learning phase, large quantities of texts are analyzed in order to estimate the conditional probability of word. This means that for each word present in a text, we will study the probability that this word appears
by knowing the last n 1 words.

This study, when carried out on a large amount of text, makes it possible to model links between words. It is more likely that a verb will be preceded by a subject or that an adjective will be preceded or followed by a name. Indeed, in texts used during the learning process, these cases will be further considered.
During the
decoding phase, we use the pre-calculated statistics to predict a future word with the previous word. For example, it will be more likely to observe this sequence of words "I want to eat a tomato",rather than "I want to hit a tomato", and even more likely than "I want to eat a carpet". All these probabilities are modelled in the form of a graph called a Word Lattice.

language model and word lattice

                                                                                           Figure 1: Example of a Word Lattice

Thanks to the language model we are able to create probabilistic links between words, which allows us to obtain a more logical sequence of words. The flaw is that the speech contains syntax errors, hesitations and formulations specific to a spoken language. This is simply because we do not speak in the same way as we write. For example, while it is more common to say ”Went to Barcelona for the weekend. Lots to tell you. ", it is more common to write ”We went to Barcelona for the weekend. We have a lot of things to tell you. ”. It will be more challenging to model these differences.

Decoding: the transformation of speech into text

We saw the acoustic model, the pronunciation model and the language model. We have seen all the steps involved in transcribing an audio file into text. The decoding phase, by a combination of the three models, makes it possible to predict the sentences most likely pronounced by a person. Here is a diagram summarizing the process.

language model and pronunciation model

                                                                                      Figure 2: Summary of the different models 

This concludes our R&D series which aims to show you how the system is able to make the link between:
– an audio file containing speech,
– the pronounced text.

In these 3 articles, we looked at the question of phonemes, the importance of pronunciation, and finally how the language model allows the system to make the word sequences coherent.
All these steps are essential to understand
how speech recognition technology works.


Authot. You speak. We write.