SciTech

CMU alumnus sheds light on Google speech project

Pedro Moreno, a Carnegie Mellon alumnus and researcher at Google, spoke on campus last week about Google’s “Speech Internationalization Project,” which allows users to make verbal requests in 300 languages. (credit: Jennifer Coloma/Staff) Pedro Moreno, a Carnegie Mellon alumnus and researcher at Google, spoke on campus last week about Google’s “Speech Internationalization Project,” which allows users to make verbal requests in 300 languages. (credit: Jennifer Coloma/Staff)

Pedro Moreno, a research scientist at Google and a Carnegie Mellon alumnus, spoke in the Giant Eagle Auditorium last Friday about Google’s current efforts in building speech recognition systems for the top 300 languages on the planet. The project, called “The Speech Internationalization Project,” allows users to give verbal requests to their phones in any language. The textual transcription of the spoken words is used to fulfill the request.

Moreno obtained his Ph.D. from Carnegie Mellon in 1996. His thesis, titled “Speech Recognition in Noisy Environments,” proved to be only the first step in his career. After working as a research scientist at HP Labs, Moreno joined Google seven years ago. He currently leads the global speech engineering group in Google’s Android division in New York. His team is in charge of creating speech recognition services in as many languages as possible.

The development of this speech recognition technology, Moreno explained, has three phases. The first phase involves collecting spoken words for a particular language. To do this, Google employees collect thousands of queries that people type into Google’s search engine and record them being read aloud. These pieces of audio are paired with the textual transcriptions and placed in the Google servers. These audio and text queries represent the training data that the speech recognition technology will use to improve its performance. In the last two years, Google has been able to create training data for as many as 50 languages.

The next phase is producing acoustic models — mapping each phoneme to how the audio sounds — and producing language models that review the text and create a mechanism that predicts how likely it is that a certain word will appear after a particular sequence of words. Spoken language gradually changes over time, and in order to account for this, the acoustic and language models have to be updated regularly with new spoken queries representing new training data.

The final stage is the formation of the lexicons — how each word in a language is expressed as a sequence of phonemes.

One of the valuable lessons that Moreno learned in the development of the project was that, when faced with a problem in a specific language, the solution should be as general and holistic as possible, so it can be applied to the speech recognition of other languages.

Presently, users of the system collectively speak about 55,000 hours worth of queries each day, and 40 percent of those users are outside of the U.S., showing that “The Speech Internationalization Project” is becoming popular around the world.

Thirty-five languages and dialects have been launched (including Latin and Pig-Latin), and 10 more languages are in pre-production.

The speech team at Google is also trying to work out issues in dealing with multilinguality in countries. When there are many different languages used within a single country, issues arise with how to create new recognition systems for similar languages. Capturing the lexicons of those languages becomes difficult.

Despite the challenges of the project, it has continued to advance. As Moreno put it, why create speech recognizers for three languages “when you can do that for 300 with the same effort?”