Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1797 members, 1726 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «deep_learning»:

snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.



Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery

snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;



2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - Блог -

snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;



snakers4 (Alexander), March 26, 15:30


Updated my DL/ML dockerfile with

- cuda 10

- PyTorch 1.0

TF now also works with cuda 10



Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:


2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;



2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:


Weight normalization (used in TCN

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;



snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;




2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 17, 15:40

New large dataset for you GAN or pix2pix pet project

500k fashion images + meta-data + landmarks



DeepFashion2 Dataset - switchablenorms/DeepFashion2

snakers4 (Alexander), March 07, 15:42

Our experiments with Transformers, BERT and generative language pre-training


For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.

On the other hand we have definitively shown that:

- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;

- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;

- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;

All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.


Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages

Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages. Towards a new state of the art? Статьи автора - Блог -

An approach to ranking search results with no annotation

Just a small article with a novel idea:

- Instead of training a network with CE - just train it with BCE;

- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);

Works best if your ontology is relatively simple.


Learning to rank search results without annotation

Solving search ranking problem Статьи автора - Блог -

snakers4 (Alexander), March 07, 11:21

Inception v1 layers visualized on a map

A joint work by Google and OpenAI:


- Take 1M random images;

- Feed to a CNN, collect some spatial activation;

- Produce a corresponding idealized image that would result in such an activation;

- Plot in 2D (via UMAP), add grid, averaging, etc etc;


Activation Atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.

snakers4 (Alexander), March 07, 09:58

Russian STT datasets

Anyone knows more proper datasets?

I found this (60 hours), but I could not find the link to the dataset:

Anyway, here is the list I found:

- 20 hours of Bible;

- - does not say how many hours

- Ofc audio book datasets - + and some scraping scripts

- And some disappointment here


Download 274_Paper.pdf 0.31 MB

snakers4 (Alexander), March 07, 06:47

PyTorch internals


PyTorch under the hood

Presentation about PyTorch internals presented at the PyData Montreal in Feb 2019.

snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;




2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - Блог -

snakers4 (Alexander), February 28, 07:16

LSTM vs TCN vs Trellis network

- Did not try the Trellis network - decided it was too complex;

- All the TCN properties from the digest hold - did not test for very long sequences;

- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;

- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;


2018 DS/ML digest 31

2018 DS/ML digest 31 Статьи автора - Блог -

snakers4 (Alexander), February 27, 07:50

New variation of Adam?

- [Website](;

- [Code](;

- Eliminate the generalization gap between adaptive methods and SGD;

- TL;DR: A Faster And Better Optimizer with Highly Robust Performance;

- Dynamic bound on learning rates. Inspired by gradient clipping;

- Not very sensitive to the hyperparameters, especially compared with Sgd(M);

- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;


Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.

snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;




2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - Блог -

snakers4 (Alexander), February 17, 10:22

A bit of lazy Sunday admin stuff

Monitoring you CPU temperature with email notifications

- Change CPU temp to any metric you like

- Rolling log

- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)

Setting up a GPU box on Ubuntu 18.04 from scratch



Plain temperature monitoring in Ubuntu 18.04

Plain temperature monitoring in Ubuntu 18.04. GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), February 13, 09:02

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.

Just use nvidia's apex, DistributedDataParallel, DataParallel.

Best examples [here](

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.


So that you can seamleassly you all the abstractions from (1) with ease.

Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':
elif loop_type=='Val':

with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here

(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.




📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT

PyTorch DataLoader, GIL thrashing and CNNs

Well all of this seems a bit like magic to me, but hear me out.

I abused my GPU box for weeks running CNNs on 2-4 GPUs.

Nothing broke.

And then my GPU box started shutting down for no apparent reason.

No, this was not:

- CPU overheating (I have a massive cooler, I checked - it works);

- PSU;

- Overclocking;

- It also adds to confusion that AMD has weird temperature readings;

To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with workers > 0 it can lead to system instability instead of speeding up.

It is obvious in retrospect, but it is not when you face this issue.



snakers4 (Alexander), February 11, 06:22

Old news ... but Attention works

Funny enough, but in the past my models :

- Either did not need attention;

- Attention was implemented by @thinline72 ;

- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:



SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;




2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - Блог -

snakers4 (Alexander), February 04, 08:37

A new paradigm in ML?



Understanding Neural ODE's

In this blogpost I explore how ODE’s can be used to solve data modelling problems. I take a deep dive into the data modelling problem at hand and present ODE’s (which model rates of change) as an a...

snakers4 (Alexander), January 31, 09:41

Second 2019 DS / ML digest

Highlight of the week - Facebook's LASER.




2019 DS/ML digest 02

2019 DS/ML digest 02 Статьи автора - Блог -

snakers4 (Alexander), January 24, 14:48

Neat PyTorch hack

(1) If possible Implement your complex loss / logic within your model.forward()

(2) Enjoy the multi-GPU / multi-node training wrappers from APEX, PyTorch DataParallel, DistributedDataParallel etc



snakers4 (Alexander), January 23, 11:26

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?

- Plain PyTorch 1.0 / numpy / FAISS based;

- [Release](, [library](;

- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
The sentence encoder is implemented in PyTorch with minimal external dependencies.
Languages with limited resources can benefit from joint training over many languages.
The model supports the use of multiple languages in one sentence.
Performance improves as new languages are added, as the system learns to recognize characteristics of language families.
They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.




LASER natural language processing toolkit - Facebook Code

Our natural language processing toolkit, LASER, performs zero-shot cross-lingual transfer with more than 90 languages and is now open source.

snakers4 (Alexander), January 23, 08:12

Pre-trained BERT in PyTorch


Model code here is just awesome.

Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.

FP16 precision training from APEX just works (no idea about convergence though yet).


As for model weights - I cannot really tell, there is no dedicated Russian model.

The only problem I am facing now - using large embeddings bags batch size is literally 1-4 even for smaller models.

And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.


Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).


Why even tackle such models?

Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.

Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).




📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT

snakers4 (Alexander), January 15, 08:33

First 2019 DS / ML digest

No particular highlights - just maybe ML industrialization vector is here to stay?




2019 DS/ML digest 01

2019 DS/ML digest 01 Статьи автора - Блог -

snakers4 (Alexander), January 10, 09:49

Someone implemented instance weighted CE loss for PyTorch


Pytorch instance-wise weighted cross-entropy loss

Pytorch instance-wise weighted cross-entropy loss. GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), January 09, 09:15

Using nargs

Wrote about this a year ago.

Forgot about it, a friend reminded me.

You can pass lists to the python command line arguments.

parser.add_argument('--classifier_conf', default=[512, 2048, 5005], nargs='+', type=int)

and then just add params to your call as follows

--classifier_conf 512 2048 5005


snakers4 (Alexander), December 30, 2018

Spark in me 2018 annual retrospective


- My personal progress and some views;

- ML is still amazing, but there are no illusions anymore;

- Telegram is still amazing, but commercialization looms;

- FAIR is an inspiration;

- Imcinnes with UMAP and HDBSCAN as well;


Еще написал немного по-русски, немного со спецификой, если вам так удобнее



Spark in me - annual retrospective 2018

Spark in me - annual retrospective 2018 Статьи автора - Блог -

snakers4 (Alexander), December 29, 2018

Yet another repo with all possible pre-trained imagenet models

Now on 4 frameworks...

Looks too good to be true



Sandbox for training large-scale image classification networks for embedded systems - osmr/imgclsmob

older first