電腦音譯的那天離我們更近了
剛剛看到《經濟學人》的一篇報導,Conquering Babel,有關於電腦音譯(直接將口語轉譯成另一種語言並”講”出來)的現況。隨著科技的日新月異,電腦音譯的時代已經離我們不遠了,不再是科幻電影裡面的噱頭。
記得當時在研究這類辨識的時候,錄製下來的語音,要先經過裁剪(Crop)、正規化(Normalize)、特徵值擷取(Eigenvalue)、數位化(Digitize)等步驟以後,才變成電腦能夠分析的資料。往往一個步驟,就得耗費很多計算時間,根本不可能做到即時翻譯。隨著DSP硬體的進步,我想這些處理,以現在的硬體來說,可以在很短的時間做完。再加上演算法的進步(文中提到微軟的Deep-Neural-Network),簡直不可同日而語。
原文很值得參考,翻譯如下:
Conquering Babel
征服巴別塔
Simultaneous translation by computer is getting closer
電腦音譯的那天離我們更近了
IN “STAR TREK”, a television series of the 1960s, no matter how far across the universe the Starship Enterprise travelled, any aliens it encountered would converse in fluent Californian English. It was explained that Captain Kirk and his crew wore tiny, computerised Universal Translators that could scan alien brainwaves and simultaneously convert their concepts into appropriate English words.在1960年代開始的電視劇《星際迷航》裡,不論星際飛船「企業號」在宇宙裡航行到多麼遙遠的地方,它遇到的任何外星人都能用流利的加州口音的英語交談。劇中是這樣解釋的,Kirk(柯克)船長和他的船員們都佩帶著微型萬能翻譯器,可以掃瞄外星人的腦電波,然後將他們的思維用得體的英語同步表達出來。
Science fiction, of course. But the best sci-fi has a habit of presaging fact. Many believe the flip-open communicators also seen in that first “Star Trek” series inspired the design of clamshell mobile phones. And, on a more sinister note, several armies and military-equipment firms are working on high-energy laser weapons that bear a striking resemblance to phasers. How long, then, before automatic simultaneous translation becomes the norm, and all those tedious language lessons at school are declared redundant?
當然,這是科幻小說的情節。但最優秀的科幻小說往往能預言未來。許多人相信,正是《星際迷航》第一季中出現的彈開式通訊器啟發了翻蓋式手機的設計。不過也有不好的一面,許多軍方和軍事裝備公司正在開發的高能雷射武器與科幻劇情有驚人的相似。那麼,自動音譯器還要多久才能成為常見之物?學校裡沉悶乏味的語言課程什麼時候才可以取消?
Not, perhaps, as long as language teachers, interpreters and others who make their living from mutual incomprehension might like. A series of announcements over the past few months from sources as varied as mighty Microsoft and string-and-sealing-wax private inventors suggest that workable, if not yet perfect, simultaneous-translation devices are now close at hand.
這可能要比語言教師、翻譯和其他以消除溝通障礙為生的人們所希望的來得要快。在過去幾個月,源自不同的來源 -從強大的微軟到產品花哨的個人發明家- 的消息表明,音譯裝置雖非完美,但做到基本可用已經指日可待了。
Over the summer, Will Powell, an inventor in London, demonstrated a system that translates both sides of a conversation between English and Spanish speakers—if they are patient, and speak slowly. Each interlocutor wears a hands-free headset linked to a mobile phone, and sports special goggles that display the translated text like subtitles in a foreign film.
去年夏天,倫敦發明家Will Powell展示了一套系統,可以在英語和西班牙語之間即時傳譯 - 如果交談者可以耐心地說慢一點。對話雙方都戴著耳機,耳機連著一部手機;還帶著特製的眼鏡,上面可以顯示翻譯出來的文字,有點像外語片裡的字幕。
In November, NTT DoCoMo, the largest mobile-phone operator in Japan, introduced a service that translates phone calls between Japanese and English, Chinese or Korean. Each party speaks consecutively, with the firm’s computers eavesdropping and translating his words in a matter of seconds. The result is then spoken in a man’s or woman’s voice, as appropriate.
日本最大的手機運營商NTT DoCoMo在11月推出了一款服務,它可以將電話中的日語和和英、中、韓三種語言進行互譯。每一方交替說話,該公司的電腦接收通話內容並在幾秒鐘內翻譯。翻譯結果視情況輸出為男聲或女聲。
Microsoft’s contribution is perhaps the most beguiling. When Rick Rashid, the firm’s chief research officer, spoke in English at a conference in Tianjin in October, his peroration was translated live into Mandarin, appearing first as subtitles on overhead video screens, and then as a computer-generated voice. Remarkably, the Chinese version of Mr Rashid’s speech shared the characteristic tones and inflections of his own voice.
微軟的成果可能最令人心動。10月,微軟研究院院長Rick Rashid在天津出席了一場會議,他在會上用英語發言,在每句結束之後,系統即時翻譯為中文,先是在大屏幕上顯示字幕,稍後即用電腦合成的聲音讀出。令人驚奇的是,Rashid先生的中文演講和他本人的音調一致。
Que?
嗯?
Though the three systems are quite different, each faces the same problems. The first challenge is to recognise and digitise speech. In the past, speech-recognition software has parsed what is being said into its constituent sounds, known as phonemes. There are around 25 of these in Mandarin, 40 in English and over 100 in some African languages. Statistical speech models and a probabilistic technique called Gaussian mixture modelling are then used to identify each phoneme, before reconstructing the original word. This is the technology most commonly found in the irritating voice-mail jails of companies’ telephone-answering systems. It works acceptably with a restricted vocabulary, but try anything more free-range and it mistakes at least one word in four.
雖然這三種系統各不相同,卻都面臨著同樣的問題。第一個難題是識別聲音並數位化。過去,語音識別軟件將語音分解為組成語言的最小單元,稱為音素。在漢語普通話中有25個音素,英語40個,一些非洲語言則超過100個。然後由語音統計模型和一種稱為高斯混合模型的概率工具來識別每個語素,再將其組合為原來的單詞。公司電話自動應答系統中煩人的語音郵件最常用這種技術。如果嚴格按詞彙表來,這種技術還算說得過去,但只要稍微自由發揮一下,它四個單詞裡至少能搞錯一個。
The translator Mr Rashid demonstrated employs several improvements. For a start, it aims to identify not single phonemes but sequential triplets of them, known as senones. English has more than 9,000 of these. If they can be recognised, though, working out which words they are part of is far easier than would be the case starting with phonemes alone.
Rashid先生展示的翻譯系統有多處改進。首先,它識別的不是單個的音素,而是連續的三個音素,稱為senone(音組,自譯)。英語中有超過9000個。如果能夠一一識別出來,判斷出它屬於哪個單詞要遠比單純由音素判斷容易。
Microsoft’s senone identifier relies on deep neural networks, a mathematical technique inspired by the human brain. Such artificial networks are pieces of software composed of virtual neurons. Each neuron weighs the strengths of incoming signals from its neighbours and send outputs based on those to other neighbours, which then do the same thing. Such a network can be trained to match an input to an output by varying the strengths of the links between its component neurons.
微軟的音組識別系統基於深度類神經網絡原理,這是一種由人腦結構啟發而來的算法技術。這種人工網絡由不同的軟件組成虛似的神經元。每個神經元權衡相鄰神經元發來的信號強度,並根據信號強度向相鄰神經元發送信號,然後其它神經元重複同樣工作。通過調整神經元之間聯繫的權重,可以教神經網絡學習匹配輸入與輸出。
One thing known for sure about real brains is that their neurons are arranged in layers. A deep neural network copies this arrangement. Microsoft’s has nine layers. The bottom one learns features of the processed sound waves of speech. The next layer learns combinations of those features, and so on up the stack, with more sophisticated correlations gradually emerging. The top layer makes a guess about which senone it thinks the system has heard. By using recorded libraries of speech with each senone tagged, the correct result can be fed back into the network, in order to improve its performance.
我們已經確切知道,真正的人腦中,神經元由不同層次組成。深度類神經網絡模仿了這種層次結構。微軟的有九層。最底層學習待分析的語音特徵,上一層學習將這些特徵進行組合,層層向上,逐漸形成更加精密複雜的關係。最上層推測系統聽到的是哪個音組。通過使用已註明音組的語音庫,識別正確的結果反饋回網絡,這樣可以提高網絡的識別能力。
Microsoft’s researchers claim that their deep-neural-network translator makes at least a third fewer errors than traditional systems and in some cases mistakes as few as one word in eight. Google has also started using deep neural networks for speech recognition (although not yet translation) on its Android smartphones, and claims they have reduced errors by over 20%. Nuance, another provider of speech-recognition services, reports similar improvements. Deep neural networks can be computationally demanding, so most speech-recognition and translation software (including that from Microsoft, Google and Nuance) runs in the cloud, on powerful online servers accessible in turn by smartphones or home computers.
微軟的研究人員表示,他們的深度類神經網絡翻譯器相比於舊系統出錯率至少低1/3,在某些情況下,出錯率低至八個單詞僅錯一個。Google也已經開始將深度神經網絡用於安卓智能手機的語音識別(還沒有開始翻譯),並表示已經將錯誤率降低至少20%。另一家語音識別服務提供商Nuance也宣稱達到相似的進展。深度神經網絡計算量龐大,因此大多數語音識別與翻譯軟件(包括微軟、Google和Nuance的產品)都是在雲端運行,由高性能在線服務器依次處理各種智能手機和家用電腦的請求。
Quoi?
耶?
Recognising speech is, however, only the first part of translation. Just as important is converting what has been learned not only into foreign words (hard enough, given the ambiguities of meaning which all languages display, and the fact that some concepts are simply untranslatable), but into foreign sentences. These often have different grammatical rules, and thus different conventional word orders. So even when the English words in a sentence are known for certain, computerised language services may produce stilted or humorously inaccurate translations.
然而語音識別只是翻譯的第一步。同樣重要的是,將信息轉化為外語的句子,還不僅僅是詞(詞已經夠難了,每種語言在詞義上都有歧義之處,更有些意義是根本無法翻譯的)。這通常要用到不同的語法規則和不同的慣用語序。因此,即使一個句子裡的所有英語單詞的含義都是確定的,電腦也有可能翻譯出來或彆扭或搞笑的不準確說法。
Google’s solution for its Translate smartphone app and web service is crowd-sourcing. It compares the text to be translated with millions of sentences that have passed through its software, and selects the most appropriate. Jibbigo, whose translator app for travellers was spun out from research at Carnegie Mellon University, works in a similar way but also pays users in developing countries to correct their mother-tongue translations. Even so, the ultimate elusiveness of language can cause machine-translation specialists to feel a touch of Weltschmerz.
Google用於網絡翻譯服務和智能手機翻譯程序的解決方案稱為眾包。它將待翻譯文本與軟件處理過的上百萬條語句進行對比,再選擇最合適的。Jibbigo是根據卡內基梅隆大學研究成果發展而來的旅遊用翻譯軟件,它的原理類似,不同之處在於它向發展中國家的使用者們支付費用,用以改進他們的母語的翻譯水平。但即使如此,語言中最終極的隱晦含義也會讓機器翻譯工程師們感到一絲悲觀厭世。
For example, although the NTT DoCoMo phone-call translator is fast and easy to use, it struggles—even though it, too, uses a neural network—with anything more demanding than pleasantries. Sentences must be kept short to maintain accuracy, and even so words often get jumbled.
例如,雖然NTT DoCoMo的電話翻譯快捷易用,但任何比客套話複雜的句子都會讓它難於應對--即使它也使用了類神經網絡。句子必須要短以保持準確性,但用詞仍然會混亂不堪。
Microsoft is betting that listeners will be more forgiving of such errors when dialogue is delivered in the speaker’s own voice. Its new system can encode the distinctive timbre of this by analysing about an hour’s worth of recordings. It then generates synthesised speech with a similar spread of frequencies. The system worked well in China, where Mr Rashid’s computerised (and occasionally erroneous) Mandarin was met with enthusiastic applause.
微軟寄希望於,如果以說者自己的聲調說話,那麼聽者會對這些錯誤更加寬容一些。在對說話者一小時的語音語錄進行分析之後,微軟的新系統可以將獨特的音色進行編碼,再以相似的頻率範圍進行語音重現。這套系統在中國的使用效果良好,當時Rashid先生的電腦合成(偶有錯誤)普通話收到了熱烈的掌聲。
A universal translator that works only in conference halls, however, would be of limited use to travellers, whether intergalactic or merely intercontinental. Mr Powell’s conversation translator will work anywhere that there is a mobile-phone signal. Speech picked up by the headsets is fed into speech-recognition software on a nearby laptop, and the resulting text is sent over the mobile-phone network to Microsoft’s translation engine online.
如果一款萬能翻譯器僅適用於會議場合,那旅行者使用它就會受到限制,不論是星際旅行或僅是洲際旅行。Powell先生的對話翻譯器只要有手機信號就能用。語音由耳機接收,再輸入附近的筆記本中的語音識別軟件,識別出的文本再由手機通信網絡傳給微軟的在線翻譯引擎。
One big difficulty when translating conversations is determining who is speaking at any moment. Mr Powell’s system does this not by attempting to recognise voices directly, but rather by running all the speech it hears through two translation engines simultaneously: English to Spanish, and Spanish to English. Since only one of the outputs is likely to make any sense, the system can thus decide who is speaking. That done, it displays the translation in the other person’s goggles.
一大難題是在翻譯對話時隨時判斷出誰在說話。Powell先生的系統不是通過直接識別聲音來判斷,而是由兩個翻譯引擎(英語譯西語,西語譯英語)同時翻譯語音。因為翻譯的結果只有一種是合理的,這樣系統就可以決定是誰在講話了。然後,它就把翻譯內容顯示在另一個人的眼鏡上。
At the moment, the need for the headsets, cloud services and intervening laptop means Mr Powell’s simultaneous system is still very much a prototype. Consecutive, single-speaker translation is more advanced. The most sophisticated technology currently belongs to Jibbigo, which has managed to squeeze speech recognition and a 40,000-word vocabulary for ten languages into an app that runs on today’s smartphones without needing an internet connection at all.
Powell先生的音譯系統目前需要還使用耳機、雲服務和介於中間的筆記本,這說明它仍然只是一個設計原型。單一說者的交替傳譯技術更高級一些。目前Jibbigo的技術是最先進的,它能把語音識別和支持10種語言4萬詞彙量的數據庫放進一個程序裡,運行於今天的智能手機上而不需要任何網絡聯繫。
Nani?
什麼?
Some problems remain. In the real world, people talk over one another, use slang or chat on noisy streets, all of which can foil even the best translation system. But though it may be a few more years before “Star Trek” style conversations become commonplace, universal translators still look set to beat phasers, transporter beams and warp drives in moving from science fiction into reality.
問題仍然存在。在現實生活中,人與人之間的交談相互重疊,夾雜著俚語或是在吵鬧的大街上說話,這些都能讓最先進的翻譯器無能為力。但即使還要過幾年《星際迷航》式對話才能成為尋常事物,萬能翻譯器仍然遠比相位武器、傳送光束和曲速引擎更有希望從科幻小說中走入現實。