LASER (Language-Agnostic SEntence Representations), is a method to generate pre-trained language representation in multiple languages. It was released by Facebook.
In Part 1 of this post, I will give an overview of LASER (Language-Agnostic SEntence Representations) and in Part 2, I will discuss an implementation of it.
A Language Model is a probability distribution over sequences of words. It is required to get the probability of a word appearing in a given context. Language Model helps represent text to a form, understandable from the machine point of view.
There were various language models released in previous years, which had an excellent performance. Most language models are trained in English or a few local languages. However, there is a need to be able to handle multiple languages in the same model. But, none of the language models had multilingual support. The problem with the traditional language models is that they need to be trained in each language to carry out the tasks. Moreover, the non-availability of adequate datasets for all languages is an important factor that acts as a constraint to train the model for specific tasks. But, with the release of a few multilingual language models, eg: BERT by Google and LASER by Facebook, which provides multilingual support, it aims to solve these issues. So, for multilingual support, I have taken LASER for carrying out the Nlp tasks for multilingual support.
LASER (Language-Agnostic SEntence Representations) was released by Facebook on Jan 22, 2019, to provide multilingual support. LASER provides multilingual sentence representations to carry out various NLP tasks. It works with more than 90 languages, written in 28 different alphabets. LASER achieves these results by embedding all languages jointly in a single shared space (rather than having a separate model for each).
LASER approach is based on zero-shot transfer( zero-shot learning is about leveraging deep learning networks already trained by supervised learning in other ways, without additional supervised learning) of NLP models from one language, such as English, to scores of others — including languages where training data is extremely limited. LASER uses one single model to handle this variety of languages, including very low-resource languages. This model is extremely helpful to provide multiple NLP features, such as sentiment analysis, in one language and then can be deployed easily in more than 100 other languages, without any separate training on each language.
The diagram shows the LASER’s approach, which embeds all languages in a single, shared space and the semantically similar vectors are placed in close proximity to each other.
LASER’s vector representations of sentences are generic with respect to both the input language and the NLP task. It maps a sentence in any language to a point in a high-dimensional space with the goal that the same statement in any language will end up in the same neighborhood closely placed. This representation could be seen as a universal language in a semantic vector space. It can be observed from the figure above that the distance in space correlates very well to the semantic closeness of the sentences.
For eg. , the dog is brown (an English sentence) and it’s French translation:Le chein est brun are very closely placed, as they have the same meaning.
Laser architecture is the same as neural machine translation: an encoder/decoder approach, it uses one shared encoder for all input languages and a shared decoder to generate the output language. The encoder is a five-layer bidirectional LSTM network. It does not use an attention mechanism but instead, has a 1,024-dimension fixed-size vector to represent the input sentence. It is obtained by max-pooling over the last states of the BiLSTM, which enables to compare sentence representations and feed them directly into a classifier. It was possible by using a shared BPE vocabulary trained on the concatenation of all languages.
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems, whereas, a bidirectional LSTM (BiLSTM) layer learns bidirectional long-term dependencies between time steps of time series or sequence data. These dependencies can be useful when you want the network to learn from the complete time series at each time step.
For more details, you can find the academic paper here.