对NLP及MT牛人OCH的采访:
This week
How often do you add new languages to Google Translate?
Credit: Google.
The last language we added was Haitian Creole. I myself am quite surprised that we can build MT technologies for very small languages. If you'd asked me three years ago, when would you have Haitian Creole, or Yiddish, or Icelandic, I would've said that with statistical machine translation (SMT), the challenge is how much data you have, so probably quite some time -- if it ever works.
How is it possible to make the system work for a language like Yiddish, where there's not much text out there to train the machine with?
What made it possible is that Yiddish is very similar to German, and has a lot of similarities to loan-words from Hebrew and Polish.
How did Google figure out so early that it was going to be important to be able to translate the Web?
The language barrier is really a very big problem for communication.
When I joined Google, I actually talked to Larry [Page] about that on the phone, because I was concerned about why Google would do MT -- it's a search engine company. He emphasized that it's really core to the mission of Google, and not just a side thing where if times get hard, then MT will [fall by the wayside].
It's now important in areas like search, where we now have the idea of cross-lingual translated search.
How close are you to making that a reality?
It's a hard question. In some sense, I believe we've made
So I feel my job is relatively safe. For quite a few years, there will be things still to be improved.
When you train the translator, you've got to get so-called parallel data sets, where every document occurs in at least two languages.
When we started, there were standard test sets provided by the Linguistic Data Consortium, which provides data for research and academic institutes. Then there are places like the United Nations, which have all their documents translated into the six official languages of the United Nations.
But then otherwise, it's kind of 'the Web.'
Our algorithms basically mine everything that's out there.
So it's sort of analogous to the way Google's Web crawler spiders Web pages?
It's similar. While the Web crawler is mining the whole Web and indexing it, then for the translation crawler is the subset of documents that include translations.
Do you use the data from Google Books as a source of translated data?
That's obviously a very interesting data source because a lot of books have been translated into many different languages.
The Android version of Google Translate allows the user to speak to the application, and have his or her words translated.
The way we are doing speech recognition and MT are conceptually rather similar. Both of them learn from large amounts of data. For MT, we need to mine those translations, but for speech recognition, what you need is a speech signal that you tape somehow, and then the transcription. The more of the transcribed speech you have, the better the speech recognition quality.
You have similar learning algorithms. In translation we learn the correlation in how words relate from source to target language.
Is it just short step from here to real time, speech-to-speech translation, a la "Star Trek's" universal translator?
To really do the integrated speech-to-speech translation, where you can have a phone call with someone and it would interpreted live? I believe that based on the technology that we have, and the improvement rate we have in the core quality of MT and speech recognition, that it should be possible to do that in the not-too-distant future.
谷歌翻译负责人Franz Josef Och分享了谷歌如何不断突破语言障碍,为用户提供更好的翻译体验。从添加新语言到改进翻译质量,谷歌正在努力实现跨语言搜索等重要功能。

被折叠的 条评论
为什么被折叠?



