29 September, 2023

Babble on Babel

Picture this: You’re on a business trip, and you’ve just entered a room full of potential partners, but you can’t understand a word they’re saying. The energy and potential in the room are plain and clear, yet the language barrier feels like a towering wall, keeping you apart and unable to connect. But suppose you had a tool that could translate their words in real-time, breaking down that wall and opening a world of understanding and collaboration. This is no longer a pipe dream, thanks to SeamlessM4T, Meta’s innovative brainchild and forerunner in multimodal and multilingual AI translation models.

SeamlessM4T is not just a tool; it’s a passport to effortless global conversations, enabling users to engage with people worldwide as if they were conversing in their own language. Whether you’re negotiating international business deals, exploring new horizons, or connecting with loved ones spread across the globe, SeamlessM4T ensures your voice is understood. By removing language barriers, SeamlessM4T is not only promoting a more interconnected and multilingual society, but also opens the door to a future where the language serves as a bridge to infinite possibilities, rather than a barrier.

Creating a tool that can translate every language as quickly and accurately as the fictional Babel Fish from “The Hitchhiker’s Guide to the Galaxy” might seem simple, yet existing translation systems struggle with two major shortcomings: 1) limited language coverage; and 2) reliance on multiple models, which frequently leads to translation errors, delays, and deployment complexities. To overcome these obstacles, SeamlessM4T leverages advancements pioneered by Meta over recent years to create a universal translator. No Language Left Behind (NLLB), a text-to-text machine translation model that covers 200 languages, is used as one of the translation providers. Since its integration into Wikipedia, NLLB has grown to support more languages. Additionally, it utilises Universal Speech Translator, the first direct speech-to-speech translation tool for Hokkien, a dialect of Chinese spoken without a common writing system. Additionally, it makes use of its Massively Multilingual Speech (MMS), which offers speech recognition, language identification, and speech synthesis technology across more than 1,100 languages. In comparison to the cascaded systems now in use of converting speech-to-text, text-to-text, to finally text-to-speech implementation, SeamlessM4T improves the efficacy and quality of the translation process by amalgamating these components into a single model, marking it as the first many-to-many direct speech-to-speech translation system.

When evaluated against other models with both automatic text-free speech-to-speech translation evaluation metrics (ASR-BLEU, BLASER 2) and human evaluation, SeamlessM4T surpasses other state-of-the-art models such as Whisper V2 by OpenAI and AudioPaLM-2. It also demonstrated robustness against background noises and speaker variations in speech-to-text tasks, showing an average improvement of 38% and 49% respectively against other leading models. In alignment with responsible AI principles, SeamlessM4T was also tested for bias and added toxicity, where it significantly outperforms previous models with a 63% of reduction in added toxicity in their translation outputs.

Nonetheless, the string-matching approach used by SeamlessM4T is not without its imperfections, such as mistakenly identifying words as harmful when they are not and struggling with languages that group words together or do not separate them. These issues lead to missed detections, especially when transcribing speech to text and in languages other than English. Moreover, using lists of nouns to detect gender imbalance also has its own set of problems. The manner in which gender is assigned to words varies across languages, and using a limited list of words does not ensure the results apply to all words. These drawbacks affect the ability to accurately identify harmful language and gender imbalance, especially in languages other than English and when using a limited list of words.

As we stand on the cusp of a new era of communication, it’s vital to navigate the nuanced pathways that language and culture thread. The linguistic subtleties present in idioms, sayings, and colloquial expressions might prove to be a significant challenge as its current performance of translation of such is sometimes inconsistent, possibly leading to the potential loss of these intricacies of human conversation steeped in centuries of cultural heritage. Moreover, the mechanised immediate application of the tool might risk overshadowing the critical comprehension of contexts in a live conversation, as one cannot make “edits” in such situations. Literal translations that do not align with sentiments of the speaker may result in potential misunderstandings. Looking into the myriad complexities associated with speech, there is also much work to be done – speech is not spoken text; it contains rhythm, stress, and intonation, a vital aspect of human interaction that deserves further attention to preserve the human touch that traditionally enriches conversations.

Although there is still more work to be done, SeamlessM4T remains a groundbreaking tool in the realm of speech-to-speech translations. As the old saying goes, “Rome wasn’t built in a day”. Developed with the ambition of facilitating communication regardless of the language resource level or “world-readiness”, this open-sourced tool has the potential to revolutionise human communication, one conversation at a time.

If you are interested in learning more about AI translation and transcription models, please contact Shayna Lee at [email protected].