Automatic Speech Recognition System

By Authôt — 9 sept. 2016

Here we conclude our series of articles presenting fundamentals of automatic conversion of speech to text. After discussing the principles of speech production, capture and digitization, we now introduce important concepts about automatic speech recognition systems.

Automatic Speech Recognition is defined as the set of computational methods dedicated to the conversion of speech into text.

Dictation softwares, marketed for the first time in the 80s, is undoubtedly the most popular application of this research field. Recent progress in this domain now offer the opportunity to spread and use these technologies in a wider variety of applications like :

automatic subtitling and machine translation of videos,
Information indexing and retrieval in audiovisual documents,
and vocal human-machine interfaces.

The worldwide potential market of over a billion users of connected devices ranks automatic speech recognition among the most promising technologies of our time.

Advances over time

Innovation in Automatic Speech Recognition is based on more than 50 years of scientific research. In the 60s, the first systems, known as isolated words recognition systems, were able to recognize words pronounced individually. Their lexicons were also very limited, containing in some cases only the digits from 0 to 9 or some vowels.

In the 80s, an important investment of the US Defense Department in academic research led to the development of the first systems able to transcribe natural continuous speech. From that time, interest in this field kept on growing, generating major innovations such as

the increase of systems’vocabularies, from one thousand words to more than 100,000 words;
the processing of more and more challenging spoken documents. Modern systems are now able to decode conversational speech involving several speakers;
But also the development of speaker-independent speech recognition systems, with recent breakthroughs relying on deep learning and deep neural networks;
and important progress of technologies robust to degraded recordings. Automatic speech recognition of noisy reverberant speech is nowadays a major topic of the research community.

The Automatic Speech Recognition framework

A modern Automatic Speech Recognition System, also known as Large Vocabulary Continuous Speech Recognition, is typically composed of the 5 following modules :

an acoustic preprocessing, which identifies zones containing speech in the recordings and transform them into sequences of acoustic parameters;
a pronunciation model, which associates words with their phonetic representations;
an acoustic model, used for predicting the most likely phonemes in a speech segment;
a language model, to predict the most likely sequence of words among several texts;
and finally a decoder, which combines the predictions of acoustic and language models and highlights the most likely text transcript for a given speech utterance.

Performance of Automatic Speech Recognition are strongly linked to the methods and data used for learning the acoustic and language models. During the training, the computing power of big servers is exploited to analyze a very large amount of audio recordings and the corresponding reference transcripts.

The power of machine learning algorithms used in automatic speech recognition lies in their ability to generalize the examples available in the learning database, to transcribe efficiently unreleased statements never observed before.

Despite important progress, there is currently no universal automatic speech recognition system, able to transcribe any recording with equivalent performance. In fact automatic systems can sometimes achieve as good as human annotators, but their performance is strongly dependent in the recording as well as in the quality of the learning phase with respect to the target task.

Acoustic and language models can in most cases be adapted to new application areas through to the integration, during the learning stage, of additional prior knowledge related for example to the quality of the recording, the nature of speech, the accent or the lexical field.

Thank you for your attention!

How is a computer converting speech to text?

This was the topic we developed for you throughout this summer on the Authot’s blog, and we hope to meet you soon again on Authot. com.

TEST AUTHÔT APPLICATION

Authôt: You speak. We write.