Pre-trained BERT in PyTorch
Model code here is just awesome.
Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.
FP16 precision training from APEX just works (no idea about convergence though yet).
As for model weights - I cannot really tell, there is no dedicated Russian model.
The only problem I am facing now - using large embeddings bags batch size is literally
1-4 even for smaller models.
And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.
Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).
Why even tackle such models?
Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.
Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).