Old news ... but Attention works
Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;
It was the first time I / we tried manually building a model with plain self attention from scratch.
An you know - it really adds 5-10% to all of the tracked metrics.
Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications: