Google announces a language model "LaBSE" that can translate even "unknown languages"
Google enables high-precision multilingual embedding in the natural language model "BERT"Language-agnostic BERT sentence embedding model(LaBSE)Was announced. LaBSE pre-learns 109 languages and can perform highly accurate processing even in languages that are not in the learning data.
Google AI Blog: Language-Agnostic BERT Sentence Embedding
In a natural language model, it is necessary to perform "embedding" that expands sentences in a vector space, and in a language model that handles multiple languages, it is necessary to embed sentences in different languages in the same vector space. FacebookLASEROrm~USEIn such a multi-language embedded model, sentences are mapped directly from one language to another, but performance is poor compared to a dedicated bilingual model with limited languages, or resources for mapping are insufficient. Languages have weaknesses such as poor learning quality.
"LaBSE" developed by Google this time is a BERT embedding model that allows multilingual embedding in 109 languages. Masked Language Model (MLM), which trains a model by filling in monolingual sentences for 17 billion monolingual sentences and 6 billion bilingual sentences, and MLM is a multilingual bilingual sentence. By implementing the Translation Language Model (TLM) applied to, a model that is effective even for low-resource languages that do not have data during learning is realized. The graph of the number of datasets used for learning is shown below. The blue part is the number of sentences in a single language in each language, and the red part is the number of parallel translations with English.
The basic mechanism of LaBSE is the "translation ranking task". The translation ranking task is a task programmed to rank "which sentence is the most suitable translation" in a set of sentences in the translation destination language in a given sentence in the translation source language.
So far, the translation ranking task was excellent at embedding in two languages, but in multiple languages there was a limit on the size of the model and the number of vocabulary (big), so it was difficult to improve accuracy. However, taking advantage of the development of language models including MLM and TLM, LaBSE has 12 layers of 50,000 (vocabulary) vocabulary trained in 109 languages.TransformerLayers have been realized, and the size of the model and the number of vocabulary (great) have been successfully expanded.
We publish example sentences in various languages and their translations.TatoebaThe results of comparing the accuracy of m~USE, LASER, and LaBSE using the data of are as follows. "14 Langs" is a result in the language supported by m-USE, and "36 Langs" evaluates multilingual abilityXTREMEAs a result of the language used in, "82 Langs" is the language included in the learning data of LASER, "All Langs" is all the languages of Tatoeba, and LaBSE scores high in any language group You can see that
In addition, LaBSE is more than one-third more than 75% accurate in more than 30 languages that were not included in the training data, demonstrating the high multilingual ability of LaBSE.
“What we're showing here is just the beginning. We believe there are more important research challenges, such as building better models that support all languages,” commented Google. LaBSE is published on TensorFlow Hub.
LaBSE | TensorFlow Hub