Speech production system

Speech production

The automatic conversion of Speech into Text typically consists in four key steps:

  • the production of speech,
  • the audio capture,
  • the signal digitalization and finally,
  • the application of speech recognition algorithms.

We discuss here about speech production and we start by presenting two fundamental concepts often confused: voice and speech.


Speech Production Acquisition Chain

Voice and speech

There are fundamental differences between voice and speech.

  • Voice, is the entire set of vocal noises a human can produce. It materializes in mechanical waves propagating in the air and oscillating at frequencies ranging from 40 Hz to 8 kHz. As a reminder, the higher the frequency, the faster the wave vibrates, producing high pitch sounds.
  • Speech now is a smaller set of vocal sounds dedicated to spoken communication in a given language. The frequency range for the production of an intelligible message is more narrow with values between 300 Hz and 3. 4 kHz. In the past, analog phones were transmitting information over these frequencies, consequently causing a nasal voice effect. 

Automatic Speech Transcription aims at recognizing the spoken content of a message message and its conversion into text.

Conversely, voice recognition technologies are the set of techniques used to identify someone through his voice, for example during a police investigation.

In summary, a speech message can be conveyed by many different voices, but one voice is usually unique, mainly because it is strongly linked to one body shape. This brings us now to introduce some notions about the speech production system.

The vocal apparatus


Speech Production Vocal Apparatus

Speech production is based on complex phenomena, widely studied for their role in human cognition and communication. Here, we focus on physiological aspects.

A healthy human being produces sounds by driving air from his lungs. And the coupling between lungs, vocal folds, vocal tract, the oral and nasal cavities, but also the position of the tongue, the jaws, the lips, and the teeth, enables voice modulation and the distribution of energy into specific vibrational modes for different speech sound units.

By simply placing your hand on your throat, you can distinguish two types of sounds.

Voiced sounds are produced by vibration of the vocal folds and correspond to vowels as /a/ and /o/. On the left of the red curves in the figure below, these voiced sounds show resonance peaks in the low and medium frequencies.

Non-voiced sounds such as wheezing /s/ and explosive /p/ do not require vocal folds to vibrate. In this case, the positions of the tongue and the lips will lead to totally different energy distributions.

These differences are exploited by automatic speech recognition algorithms.


Speech Frequencies

Thank you for your attention!

How is a computer program automatically converting speech into text?

This is the questions we will continue to develop on the Authot’s blog this summer. Stay tuned!



Authôt: You speak. We write.