How Things Work: Voice recognition software

Credit: Paola Mathus/ Credit: Paola Mathus/

As technology evolves, the ability for machines to create a user experience that imitates human-human interaction also evolves. Voice recognition software, used to carry out commands or to take note of information said by the user, is crucial to that experience. Software such as Siri and Cortana (even with its questionable accuracy) are undoubtedly powerful pieces of work that are able to create unique, customized user experiences based primarily on verbal input.

Voice recognition software works via a complicated process. The audio input, in the form of analog input, is first converted to a digital input using an Analog-to-Digital Converter (ADC).

During this process, the background noise is also filtered from the input, stabilized to a constant volume level, and slowed or sped up to match that of the software.

This digital, clean version is then fragmented into segments that may last only up to a hundredth of a second based on the sounds that they contain. These short sounds match to phonemes in the set language. Phonemes are the most basic parts of the language; they are the sounds that combine to form words, such as ‘p’ or ‘t’.
Cleaning and slicing the input is significantly easier than trying to decipher the content of the phoneme-match with reference to the input’s context.

Early voice recognition software used rule-based systems to convert digital data to meaningful sentences. The problem with those systems was their obvious need to adhere to standard input.

Slang and people’s different sentence structures were difficult to recognize. Even just continuous conversational input without pauses after every input was hard to detect.

In order to overcome these challenges, today’s voice recognition software employs sophisticated statistical modeling algorithms to predict the most likely and most sensible outcome for the input.

An older, similarity-based algorithm was the dynamic time warping algorithm that, simplistically put, optimally matched short segments based on similarity of a particular characteristic.

Today, the Hidden Markov model is the most used algorithm because of its accuracy, computational feasibility, simplicity, and ability to be trained automatically. Like a typical learning program, this algorithm assigns probability scores to a given input based on the pre-determined set of words and training data. Given that most programs have 60,000 words, which makes trillions of word combinations possible to form a sentence, it is indeed a significantly important issue to reduce the number of possibilities to a reasonable set of accurate possibilities.

Although voice recognition software is pretty sophisticated today, there are limitations to its functional performance. For example, even with all the new, though questionably cool, features in the iPhone7, there is one thing that’s still missing — the ability for Siri to recognize puns.

One pretty obvious point of confusion is homonyms. Words that sound the same, such as ‘hair’ and ‘hare’, or ’bald’ and ‘bawled,’ can produce really interesting misinterpretations. Other limitations include low signal-noise ratios, overlapping speech, and the need for heavy computational power.

In terms of user experience, voice recognition can be used for home automation, in-car control, military vehicle operation, and even our very own J.A.R.V.I.S.

Besides the fact that voice recognition allows for fantastic new user experiences, it can be used for a variety of other purposes.

Language learning, pronunciation practice, and, one day, maybe even universal translators, are fabulous new ways to harness all this potential.

Voice recognition systems have also been employed for people with disabilities and injuries.

Thus, voice recognition, though sophisticated, is set to become an integral part of the next 50 years, with all the potential benefits it offers.