You might have heard that Google let 100 journalists into the Lair last week for a rare “Factory Tour,” previewing some of the goodies we can expect to see in the coming years. (It all sounds a little bit Willy Wonka-ish.) But did you hear this?
One fascinating area Google is aggressively exploring is automated language translation. Engineers have been studying the massive collection of translated documents that the United Nations keeps on its Web site — as well as other document collections — to develop a program that can automatically translate back and forth between documents.
To date, the company has examined about 200 billion words to train its system on the structures of various languages.
“If we can make every piece of the Web, every document, accessible to everybody, that will contribute something to the world,” said Alan Eustace, vice president of engineering and research. “And that’s what this project is aimed at.”
Google showed off a few translations it had performed using the new technology, from Arabic to English and from Chinese to English. They appeared nearly flawless.
The way language translation works now, apparently, is that people have created programs telling computers how different languages work. But with the complexity of language, given all its exceptions and colloquialisms, this doesn’t work very well. (First sentence of this paragraph taken from English to French and back: “The translation of language in manner functions now, apparently, is that people created with computers of programs saying how the various languages function.” Which is actually comparatively good.)
Google’s taking a Rosetta Stone approach, teaching the computers to really learn languages by statistically analyzing existing translations. Philipp at Google Blogoscoped gives his thoughts on where this could lead.