Our experiments with Transformers, BERT and generative language pre-training
For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.
On the other hand we have definitively shown that:
- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;
- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;
- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;
All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.
Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages. Towards a new state of the art? Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me
An approach to ranking search results with no annotation
Just a small article with a novel idea:
- Instead of training a network with CE - just train it with BCE;
- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);
Works best if your ontology is relatively simple.