Automatic Speech Recognition (ASR) Trends 2022
When we examine the history of computer science, we can see clear generational lines that are defined by the method of input. What is the path of information from our brains to the computer? We can link improvements in computation to the ways we interface with the digital from early punch-card computers to the familiar keyboard to the latest touch displays that we carry in our pockets. Our question, as is often the case with technology, is “what comes next?”
The human voice is the answer. ASR (Automated Speech Recognition) is the technology that makes this transition possible. ASR is essentially the use of computers to convert spoken words into written ones.
Natural Language Processing (NLP) is at the heart of the most advanced form of currently available ASR systems. This ASR variation gets the closest to facilitating actual conversation between humans and artificial intelligence.
What is Automatic Speech Recognition?
Speech Recognition is a subfield of computational linguistics dealing with the recognition and translation of spoken language into text by computers, a process known as “speech to text” in some cases. The systems are a fusion of languages, computer science, and electrical engineering influences. The phrase “speech recognition” refers to the process of converting spoken words into text in general; however, subfields such as voice recognition and speaker identification specialize in identifying both the spoken content and the speaker’s identity.
Today’s ASR is a subset of machine learning (ML), which is itself a type of artificial intelligence (AI). The former is a general technology that intends to achieve AI’s goals by teaching a computer to learn on its own, whereas the latter is a specialized technology that attempts to achieve AI’s goals by teaching a computer to learn on its own.
Natural Language Processing (NLP) is increasingly included in more advanced versions of ASR systems. These devices record actual human conversations and process them using artificial intelligence. ASR’s accuracy is influenced by a variety of parameters, including speaker volume, background noise, recording equipment, and more.
How does Automatic Speech Recognition work?
There are two types of speech recognition systems, speaker-dependent, and speaker-independent. Speaker-dependent systems are designed in such a way that training, sometimes known as “enrollment,” is required. This works by having a speaker read text into the system or a succession of discrete vocabulary. The algorithm will then analyze the vocal recordings and link them to the text collection. Speaker independent systems are speech recognition systems that do not rely on vocal training.
There are two types of models used in speech recognition systems:
Acoustic Model: A file containing statistical representations of each of the various sounds that make up a word is known as an acoustic model. A phoneme is a label given to each of these statistical representations. There are approximately 40 distinct sounds in the English language that is suitable for speech recognition, resulting in 40 separate phonemes.
Language Model: To discriminate between words that sound similar, sounds are matched with word sequences. We presume our audio sample is grammatically and semantically sound, even if it is not grammatically perfect or has skipped words. As a result, incorporating a language model into decoding can enhance ASR accuracy.
Steps involved in the process of speech recognition:
Analog-to-Digital Conversion: In most cases, speech is recorded and available in analog format. To convert analog voice to digital utilizing sampling and quantization techniques, standard sampling techniques or devices are available. A one-dimensional vector of voice samples, each of which is an integer, is typically used to represent digital speech.
Speech Pre-processing: Background noise and long periods of quiet are common in a recorded conversation. Identification and removal of silent frames, as well as signal processing techniques to reduce/eliminate noise, are all part of speech pre-processing. Following pre-processing, the speech is divided into 20-second frames for subsequent feature extraction stages.
Feature Extraction: It is the conversion of speech frames into a feature vector that specifies which phoneme or syllable is being spoken.
Word Selection: The sequence of phonemes/features is translated into the spoken word using a language model/probability model.
How ASR is Made to “Learn” from Humans: The Tuning Test
ASR systems, whether NLP or directed dialogue systems, are trained using two major approaches. Human Tuning is the first and most basic variation, whereas Active Learning is the second and more complex variant.
Human Tuning: ASR training can be done in this manner is a pretty straightforward manner. It entails human programmers searching through the conversation logs of a specific ASR software interface and seeking regularly used words that it had to hear but didn’t have in its pre-programmed vocabulary. These words are then incorporated into the software, which allows it to improve its speech understanding.
Active Learning: Active learning is a more advanced version of ASR that is being tested in conjunction with NLP versions of speech recognition technology. With active learning, the software is programmed to learn, retain, and adopt new words on its own, allowing it to continually extend its vocabulary as it is exposed to different ways of speaking and saying things.
Advantages and Disadvantages of Speech Recognition
Using speech recognition software has a number of advantages, including the following:
Easy to use
Speech recognition technology, while useful, still has a few flaws to work out. The following are some restrictions:
Source file issues
Speech recognition systems have a wide range of uses. Here are a few of them.
Automatic subtitling with speech recognition
Mobile telephony, including mobile email
People with disabilities