January 23, 08:12

Pre-trained BERT in PyTorch



Model code here is just awesome.

Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.

FP16 precision training from APEX just works (no idea about convergence though yet).


As for model weights - I cannot really tell, there is no dedicated Russian model.

The only problem I am facing now - using large embeddings bags batch size is literally 1-4 even for smaller models.

And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.


Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).


Why even tackle such models?

Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.

Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).




📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT