March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;