NLP - Highlight of the week - LASER
- Hm, a new sentence embedding tool?
- Plain PyTorch 1.0 / numpy / FAISS based;
- Looks like an off-shoot of their "unsupervised" NMT project;
- Alleged pros:
LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
They essentially trained an NMT model with a shared encoder for many languages.
It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
The sentence encoder is implemented in PyTorch with minimal external dependencies.
Languages with limited resources can benefit from joint training over many languages.
The model supports the use of multiple languages in one sentence.
Performance improves as new languages are added, as the system learns to recognize characteristics of language families.
I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.