Monday, May 25, 2009

Stop Spam, Read Books

An article in this Sunday's New York Times by Anne Eisenberg  titled, "New Puzzles That Tell Humans From Machines" provided a fascinating glimpse into the workings of reCaptcha.  

I was particularly fascinated by this passage, "The system has an unusual twist that provides an added benefit to projects that are digitizing books and papers in archives: the source of the wiggly images that people must decipher is not random. The images are drawn from books and other media that are being digitized in mass projects, but that machines haven’t been able to read because, for instance, the page is wrinkled.

"Automatic character recognition lets people who are having the work scanned know which words it cannot read. These are the words that recaptcha farms out and, once they are interpreted, returns to the original document. In this way, word by word, most of the mystery words are deciphered, in this case by humans. 

“We are digitizing about 25 million words per day by having people type in captchas,” Dr. von Ahn said.

"The audio captchas are also being used for transcription and digitization projects.  We are doing both speech and text," Dr. von Ahn said.  Take your choice.

"The Times is paying reCaptcha for its help in digitizing its archives, said Marc Frons, chief technology officer digital operations.  So far, puzzling words in archives covering about 3o years have been deciphered with reCaptchas, he said.

The reCaptchas above came from the USPTO PAIR site back in March.  I wonder whose phone number found its way into the middle of all this.  Google provided no results.

No comments:

Post a Comment