Going back a few years John talked about being able to talk to people from all different languages like in Star Trek. At the time it seemed so far fetched that most thought it was not a possibility, and often their lack of foresight hindered his vision. He wanted to be able to speak in English yet the people to understand in their home language. As teachers this would be so invaluable when we have new arrivals to our classrooms. We haven’t time to wait for an interpreter or translator to arrive, most schools do not have the finances to have a qualified teacher who is also a native speaker so cheaper and simple solutions are sought daily as people move around globally more now than ever.
It is really good to see that Microsoft are nearer to this goal than ever before. The good stuff it at around 7.05 where he speaks in English and out comes Chinese
As Dr. Rashid’s post explains in detail, this demo is less of a breakthrough than an evolutionary step, representing a new version of a long-established combination of three gradually-improving technologies: Automatic Speech Recognition (ASR), Machine Translation (MT), and speech synthesis (no appropriate standard acronym, though TTS for “text to speech” is close).
In 1986, when the money from the privatization of NTT was used to found the Advanced Telecommunication Research (ATR) Institute in Japan, the centerpiece of ATR’s prospectus was the Interpreting Telephony Laboratory. As explained in Tsuyoshi Morimoto, “Automatic Interpreting Telephone Research at ATR“, Proceedings of a Workshop on Machine Translation, 1990:
An automatic telephone interpretation system will transform a spoken dialogue from the speaker’s language to the listener’s automatically and simultaneously. It will undoubtedly be used to overcome language barriers and facilitate communication among the people of the world.
ATR Interpreting Telephony Research project was started in 1986. The objective is to promote basic research for developing an automatic telephone interpreting system. The project period is seven-years.
As of 1986, all of the constituent technologies had been in development for 25 or 30 years. But none of them were really ready for general use in an unrestricted conversational setting, and so the premise of the ATR Interpreting Telephony Laboratory was basically a public-relations device for framing on-going speech technology research, not a plausible R&D project. And so it’s not surprising that the ATR Interpreting Telephony Laboratory completed its seven-year term without producing practical technology — though quite a bit of valuable and interesting speech technology research was accomplished, including important contributions to the type of speech synthesis algorithm used in the Microsoft demo.
In the 26 years since 1986, there have been two crucial changes: Moore’s Law has made computers bigger and faster but smaller and cheaper; and speech recognition, machine translation, and speech synthesis have all gotten gradually better. In both the domain of devices and the domain of algorithms, the developments have been evolutionary rather than revolutionary — the reaction of a well-informed researcher from the late 1980s, transplanted to 2012, would be satisfaction and admiration at the clever ways that familiar devices and algorithms have been improved, not baffled amazement at completely unexpected inventions.
All of the constituent technologies — ASR, MT, speech synthesis — have improved to the point where we all encounter them in everyday life, and some people use them all the time. I’m not sure whether Interpreting Telephony’s time has finally come, but it’s clearly close.
In any case, the folks at Microsoft Research are at or near the leading edge in pushing forward all of the constituent technologies for speech-to-speech translation, and Rashid’s speech-to-speech demo is an excellent way to publicise that fact.