Google adopts CMU reCAPTCHA creation
On Sept. 16, Google announced that it had acquired reCAPTCHA, a project first created by Carnegie Mellon computer science professor Luis von Ahn. Most would recognize the program as those distorted lines of text that one must type in to create online accounts and much more; they originate from a program called reCAPTCHA that is able to distinguish between a human and an automated user.
The first CAPTCHA, created by von Ahn and his Ph.D. adviser and Carnegie Mellon professor Manuel Blum, produced a randomly generated series of characters that a user had to type before accessing a web page or other online site to prevent spam, among other uses.
The current reCAPTCHA has an expanded purpose that the original did not: that of helping preserve the world’s pre-digital era books for generations to come.
“At some point, we realized that about 200 million of these CAPTCHAs were typed every day by people around the world, and the idea for reCAPTCHA came from trying to make good use of the time spent typing them. Two hundred million times a day is equivalent to [about] 500,000 hours every day,” said von Ahn, who wanted to harness this human effort for the useful project of digitizing old books and newspapers.
The current method used to scan old texts is called OCR, or optical character recognition. A page is scanned into a computer as a digital image, and then OCR attempts to decipher the words, but it is not always right. “That program is not very accurate for very old books or newspapers because the ink has faded,” von Ahn said. That is where reCAPTCHA begins to solve the dilemma.
Words that could not be accurately identified by OCR are presented to a user along with a correct “control word.” If the user types the control word correctly, most of the time the user typed the unknown word correctly also. As may be familiar to users, the words often look distorted, as the reCAPTCHA program takes the image from the text, distorts the image in various ways, and then presents it to the viewer.
“Our goal with accuracy is to produce a digital file that is better than a professional human transcriber.... To achieve this, we need to combine the output of OCR software with the human answers from reCAPTCHA and decide, for each word, what the correct spelling of the word should be,” said Colin McMillen, a full-time programmer for reCAPTCHA.
To date, this process has digitized copies of The New York Times from its 150-year old archives, as referenced on the company’s website, and has maintained high levels of precision. “So far we’ve done very well on this front — even for very old and challenging documents, we get over 99 percent of words completely correct,” McMillen said.
The use of reCAPTCHA is spreading, chiefly as a result of its effective deterrence of spam, but also because it cannot be read by algorithms that worked on the original CAPTCHA, and because it has been beneficial and productive in digitizing the world’s literature.
A reCAPTCHA also takes approximately the same time to solve as a typical CAPTCHA, so the benefits are real. Over 100,000 websites, according to a United Kingdom telegraph source, are known to use it, and individual users can also download an HTML code from the company’s website to protect their e-mail addresses.
In just one year, over 1.2 billion reCAPTCHAs were deciphered worldwide.
Where Google will take the ingenious program remains to be seen. This is not the first time that Google has worked with projects that began at Carnegie Mellon. In 2006, the company also licensed the ESP game developed by von Ahn that allows for images to be labeled through online games. Its current name is the Google Image Labeler. Von Ahn, with such accolades as a member of Popular Science’s list of Ten Brilliant Scientists of 2006 and the MacArthur Foundation’s “genius grant” will continue his research at Carnegie Mellon and remain in Pittsburgh while he works with Google.
The dean of the School of Computer Science, Randal Bryant, commented that “it’s a natural fit for Google, who both have the resources to run the reCAPTCHA system on a much larger scale and have millions of books that have been scanned but not yet digitized.”