Installing apex ... in style )
Sometimes you just need to try fp16 training (GANs, large networks, rare cases).
There is no better way to do this than use Nvidia's APEX library.
Luckily - they have very nice examples:
Well ... it installs on a clean machine, but I want my environment to work with this always)
So, I ploughed through all the conda / environment setup mumbo-jumbo and created a version of our deep-learning / ds dockerfile, but now instlalling from pytorch image (pytorch GPU / CUDA / CUDNN + APEX).
It was kind of painful, because PyTorch images already contain conda / pip and it was not apparent at first, causing all sorts of problems with my miniconda instalation.
So use it and please report if it is still buggy.
Logging your hardware, with logs, charts and alers - in style
TLDR - we have been looking for THE software to do this easily, with charts / alerts / easy install.
We found prometheus. Configuring alerts was a bit of a problem, but enjoy: