August 11, 04:42

Extreme NLP network miniaturization

Tried some plain RNNs on a custom in the wild NER task.

The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.

I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.

What is interesting:

- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;

- Model works with various hidden sizes

- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;

- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;

As added bonus - you can just store such miniature model in git w/o lfs.

What is with training transformers on US$250k worth of compute credits you say?)




A new model for word embeddings that are resilient to misspellings

Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.