January 23, 11:26

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?

- Plain PyTorch 1.0 / numpy / FAISS based;

- [Release](code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](github.com/facebookresearch/LASER);

- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
The sentence encoder is implemented in PyTorch with minimal external dependencies.
Languages with limited resources can benefit from joint training over many languages.
The model supports the use of multiple languages in one sentence.
Performance improves as new languages are added, as the system learns to recognize characteristics of language families.
They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.




LASER natural language processing toolkit - Facebook Code

Our natural language processing toolkit, LASER, performs zero-shot cross-lingual transfer with more than 90 languages and is now open source.