Extreme NLP network miniaturization
Tried some plain RNNs on a custom in the wild NER task.
The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.
I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.
What is interesting:
- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;
- Model works with various hidden sizes
- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;
- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;
As added bonus - you can just store such miniature model in git w/o lfs.
What is with training transformers on US$250k worth of compute credits you say?)
Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.