Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1818 members, 1744 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «deep_learning»:

snakers4 (Alexander), May 27, 08:43

2019 DS / ML digest 11

Highlights of the week(s)

- New attention block for CV;

- Reducing the amount of data for CV 10x?;

- Brain-to-CNN interfaces start popping up in the mainstream;



2019 DS/ML digest 11

2019 DS/ML digest 11 Статьи автора - Блог -

snakers4 (Alexander), May 24, 09:20

Really working in the wild audio noise reduction libraries

Spectral gating

It works. But you need a sample of your noise.

Will work well out of box for larger files / files with gaps where you can pay attention to each file and select a part of file that would act as noise example.

RNNoise: Learning Noise Suppression

Works with any arbitrary noise. Just feed your file.

It works more like adative equalizer.

It filters noise when there is no speech.

But it mostly does not change audio when speech is present.

As authors explain, it improves snr overall and makes sound less "tiring" to listen.

Description / blog posts



Step-by-step instructions in python





Noise reduction / speech enhancement for python using spectral gating - timsainb/noisereduce

snakers4 (Alexander), May 20, 06:21

New in our Open STT dataset

- An mp3 version of the dataset;

- A torrent for mp3 dataset;

- A torrent for the original wav dataset;

- Benchmarks on the public dataset / files with "poor" annotation marked;





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 14, 03:40

2019 DS / ML digest 10

Highlights of the week(s)

- New MobileNet;

- New PyTorch release;

- Practical GANs?;



2019 DS/ML digest 10

2019 DS/ML digest 10 Статьи автора - Блог -

snakers4 (Alexander), May 09, 11:28 / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made post:


Also our post was accepted to an editor's pick part of TDS:


Share / give us a star / clap if you have not already!

Original release




Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе занима...

snakers4 (Alexander), May 09, 10:51

PyTorch DP / DDP / model parallel

Finally they made proper tutorials:




Model parallel = have parts of the same model on different devices

Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process

Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine


The State of ML, eof 2018 in Russian

Quite down-to-earth and clever lecture

Some nice examples for TTS and some interesting forecasts (some of them happened already).


Сергей Марков: "Искусственный интеллект и машинное обучение: итоги 2018 года."
Лекция состоялась в научно-популярном лектории центра "Архэ" ( 16 января 2019 года. Лектор: Сергей Марков — автор одной из сильнейших росс...

snakers4 (Alexander), May 03, 08:58


PyTorch 1.1

- Tensorboard (beta);

- DistributedDataParallel new functionality and tutorials;

- Multi-headed attention;

- EmbeddingBag enhancements;

- Other cool, but more niche features:

- nn.SyncBatchNorm;

- optim.lr_scheduler.CyclicLR;



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), May 02, 06:02

Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.

It was a lot of work.

The dataset:

Accompanying post:


- On third release, we have ~4000 hours;

- Contributors and help wanted;

- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.






Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 02, 05:41

Poor man's computing cluster

So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).

It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.

So, why I am saying this?

Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.

Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).

Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).

It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.

Now let's crunch the numbers

According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.

So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:

- 20-25% of DGX desktop;

- 1 month of renting on Amazon;

Assuming that all the hardware will just break in a year:

- It is 4-5x cheaper than buying from Nvidia;

- It is 10x cheaper than renting;

If you buy everything used, then it is 10x and 20x cheaper!

I would buy that for a dollar!

Ofc you have to invest your free time.

See my calculations here:




config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...

snakers4 (Alexander), April 22, 11:44

Cool docker function

View aggregate load stats by container


docker stats

Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...

2019 DS / ML digest 9

Highlights of the week

- Stack Overlow survey;

- Unsupervised STT (ofc not!);

- A mix between detection and semseg?;



2019 DS/ML digest 09

2019 DS/ML digest 09 Статьи автора - Блог -

snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.



Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery

snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;



2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - Блог -

snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;



snakers4 (Alexander), March 26, 15:30


Updated my DL/ML dockerfile with

- cuda 10

- PyTorch 1.0

TF now also works with cuda 10



Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:


2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;



2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:


Weight normalization (used in TCN

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;



snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;




2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 17, 15:40

New large dataset for you GAN or pix2pix pet project

500k fashion images + meta-data + landmarks



DeepFashion2 Dataset - switchablenorms/DeepFashion2

snakers4 (Alexander), March 07, 15:42

Our experiments with Transformers, BERT and generative language pre-training


For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.

On the other hand we have definitively shown that:

- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;

- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;

- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;

All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.


Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages

Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages. Towards a new state of the art? Статьи автора - Блог -

An approach to ranking search results with no annotation

Just a small article with a novel idea:

- Instead of training a network with CE - just train it with BCE;

- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);

Works best if your ontology is relatively simple.


Learning to rank search results without annotation

Solving search ranking problem Статьи автора - Блог -

snakers4 (Alexander), March 07, 11:21

Inception v1 layers visualized on a map

A joint work by Google and OpenAI:


- Take 1M random images;

- Feed to a CNN, collect some spatial activation;

- Produce a corresponding idealized image that would result in such an activation;

- Plot in 2D (via UMAP), add grid, averaging, etc etc;


Activation Atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.

snakers4 (Alexander), March 07, 09:58

Russian STT datasets

Anyone knows more proper datasets?

I found this (60 hours), but I could not find the link to the dataset:

Anyway, here is the list I found:

- 20 hours of Bible;

- - does not say how many hours

- Ofc audio book datasets - + and some scraping scripts

- And some disappointment here


Download 274_Paper.pdf 0.31 MB

snakers4 (Alexander), March 07, 06:47

PyTorch internals


PyTorch under the hood

Presentation about PyTorch internals presented at the PyData Montreal in Feb 2019.

snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;




2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - Блог -

snakers4 (Alexander), February 28, 07:16

LSTM vs TCN vs Trellis network

- Did not try the Trellis network - decided it was too complex;

- All the TCN properties from the digest hold - did not test for very long sequences;

- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;

- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;


2018 DS/ML digest 31

2018 DS/ML digest 31 Статьи автора - Блог -

snakers4 (Alexander), February 27, 07:50

New variation of Adam?

- [Website](;

- [Code](;

- Eliminate the generalization gap between adaptive methods and SGD;

- TL;DR: A Faster And Better Optimizer with Highly Robust Performance;

- Dynamic bound on learning rates. Inspired by gradient clipping;

- Not very sensitive to the hyperparameters, especially compared with Sgd(M);

- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;


Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.

snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;




2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - Блог -

snakers4 (Alexander), February 17, 10:22

A bit of lazy Sunday admin stuff

Monitoring you CPU temperature with email notifications

- Change CPU temp to any metric you like

- Rolling log

- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)

Setting up a GPU box on Ubuntu 18.04 from scratch



Plain temperature monitoring in Ubuntu 18.04

Plain temperature monitoring in Ubuntu 18.04. GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), February 13, 09:02

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.

Just use nvidia's apex, DistributedDataParallel, DataParallel.

Best examples [here](

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.


So that you can seamleassly you all the abstractions from (1) with ease.

Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':
elif loop_type=='Val':

with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here

(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.




📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT

PyTorch DataLoader, GIL thrashing and CNNs

Well all of this seems a bit like magic to me, but hear me out.

I abused my GPU box for weeks running CNNs on 2-4 GPUs.

Nothing broke.

And then my GPU box started shutting down for no apparent reason.

No, this was not:

- CPU overheating (I have a massive cooler, I checked - it works);

- PSU;

- Overclocking;

- It also adds to confusion that AMD has weird temperature readings;

To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with workers > 0 it can lead to system instability instead of speeding up.

It is obvious in retrospect, but it is not when you face this issue.



snakers4 (Alexander), February 11, 06:22

Old news ... but Attention works

Funny enough, but in the past my models :

- Either did not need attention;

- Attention was implemented by @thinline72 ;

- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:



SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;




2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - Блог -

older first