According to the US company Meta, it has developed an algorithm that can translate between 200 different languages and delivers an average of over 40 percent better quality than previous predecessors. The tool of the Facebook, Instagram and Whatsapp mother is called NLLB-200 (for “No Language Left Behind”, i.e. “Leave no language behind”) and is now being made available as open source. With additional tools, more languages can be added and more inclusive techniques can be built, explains Meta’s AI department. Based on the algorithm, translations on Facebook, Instagram and Wikipedia are to be improved first.
Recourse to better training data
Meta AI not only wants to use the algorithm to better connect people, but also to ensure that they “can become part of the metaverse tomorrow”. A “significant breakthrough” was achieved with NLLB-200. It can be used to produce “excellent translations” into and from 200 different languages, including many that translation software has not previously supported sufficiently or at all – for example Kikamba (about four million speakers in Kenya and Tanzania) and Lao (30 million speakers). in Southeast Asia). The improvement of over 40 percent on average and sometimes over 70 percent was determined by Meta with an in-house benchmark called FLORES, which was published as open source a year ago and is based on translations by native speakers.
Meta had already made public in February that they were working on an AI-supported real-time translator, but at that time there was still talk of 100 languages. Expanding to twice as many has now been a significant challenge, Meta AI explains. For example, immense amounts of parallel corpora are necessary to train the software, i.e. texts that are available in several languages. Recourse to data mining on the Internet often only delivers inferior text quality, which is why Meta resorts to professional translations and reviews. Furthermore, it is difficult to optimize a single model for hundreds of languages at a time without affecting overall performance. The pipeline for cleaning up the data, with which “toxic content” is to be filtered out, has also been completely revised.
A list of the languages supported by NLLB-200 is available in a research paper published by the research team. It is also indicated there whether the respective languages are supported by previous translation tools from Google or Microsoft. For the first time, NLLB-200 can therefore translate from and into Asturian (northern Spain) and Scottish Gaelic. Limburgish (southern Low Franconian) and Silesian are listed here as languages spoken in Germany, among others. Meanwhile, there is a demonstration of the technology on a dedicated Meta AI website. Several children’s books could be automatically translated there, but so far only into one of 15 languages. The remaining almost 200 are to follow “soon”. According to Meta AI, technology based on NLLB-200 can also be tried out in a translation tool that the Wikimedia Foundation makes available to the editors of Wikipedia.