July 12, 05:25

Installing apex ... in style )

Sometimes you just need to try fp16 training (GANs, large networks, rare cases).

There is no better way to do this than use Nvidia's APEX library.

Luckily - they have very nice examples:

- github.com/NVIDIA/apex/tree/master/examples/docker

Well ... it installs on a clean machine, but I want my environment to work with this always)

So, I ploughed through all the conda / environment setup mumbo-jumbo and created a version of our deep-learning / ds dockerfile, but now instlalling from pytorch image (pytorch GPU / CUDA / CUDNN + APEX).

- github.com/snakers4/gpu-box-setup/blob/master/dockerfile/Dockerfile_apex

It was kind of painful, because PyTorch images already contain conda / pip and it was not apparent at first, causing all sorts of problems with my miniconda instalation.

So use it and please report if it is still buggy.




A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex

Logging your hardware, with logs, charts and alers - in style

TLDR - we have been looking for THE software to do this easily, with charts / alerts / easy install.

We found prometheus. Configuring alerts was a bit of a problem, but enjoy:

- github.com/snakers4/gpu-box-setup#prometheus




Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.