Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1810 members, 1762 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «deep_learning»:

snakers4 (Alexander), August 23, 12:24

How to solve an arbitrary CV task ...

- W/o annotation

- W/o GPUs in production

- And make your model work in real life and help people



How to get your own image classifier / region labelling model without annotation

Multi-head model with a light and fast encoder without annotation/ deploy on CPU Статьи автора - Блог -

snakers4 (Alexander), August 23, 07:14

Our STT Dark Forest post on TDS

Please 👏x50 if you have an account


Navigating the Speech to Text Dark Forest

Make your ASR network 4x faster, 5x smaller and 10x cooler

snakers4 (Alexander), August 21, 07:12

2019 DS / ML digest 14


Highlights of the week(s):

- FAIR embraces embedding bags for misspellings;

- New version of Adam - RAdam. But on the only real test author has concluded (Imagenet) - SGD is better;

- Yet another LSTM replacement - SRU. Similar to QRRN - it requires additional dependencies;



2019 DS/ML digest 14

2019 DS/ML digest 14 Статьи автора - Блог -

snakers4 (Alexander), August 15, 14:42

My foray into the STT Dark Forest

My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality




Navigating the Speech to Text Dark Forest

A tongue-in-cheek description of our STT path Статьи автора - Блог -

snakers4 (Alexander), August 11, 04:42

Extreme NLP network miniaturization

Tried some plain RNNs on a custom in the wild NER task.

The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.

I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.

What is interesting:

- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;

- Model works with various hidden sizes

- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;

- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;

As added bonus - you can just store such miniature model in git w/o lfs.

What is with training transformers on US$250k worth of compute credits you say?)




A new model for word embeddings that are resilient to misspellings

Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.

snakers4 (Alexander), August 09, 03:27

PyTorch 1.2 release


Key features:

- Tensorboard logging in now out of beta;

- They continue improving JIT and ONNX;

- NN.Transformer is a layer now;

- Looks like SyncBn is also more or less stable;

- nn.Embedding: support float16 embeddings on CUDA;

- AdamW;

- Numpy compatibility;



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), August 02, 11:19

Managing your DS / ML environment neatly and in style

If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.

You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).

But what you have to do this for several people? And use it with a proper IDE via ssh?

A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.

And also you have to make your ssh daemon works inside of a container as a second service.

So I solved these "challenges" and created 2 public layers so far:

- Basic DS / ML layer - FROM aveysov/ml_images:layer-0 - from dockerfile;

- DS / ML libraries - FROM aveysov/ml_images:layer-0- from dockerfile;

Your final dockerfile may look something like this just pulling from any of those layers.

Note that when building this, you will need to pass your UID as a variable, e.g.:

docker build --build-arg NB_UID=1000 -t av_final_layer -f Layer_final.dockerfile .

When launched, this launched a notebook with extensions. You can just exec into the machine itself to run scripts or use an ssh daemon inside (do not forget to add your ssh key and service ssh start).




Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

Using public Dockerhub account for your private small scale deploy

Also a lifehack - you can just use Dockerhub for your private stuff, just separate the public part and the private part.

Push the public part (i.e. libraries and frameworks) to Dockerhub/

You private Dockerfile will be then something like:

FROM your_user/your_repo:latest

COPY your_app_folder your_app_folder


CMD ["python3", ""]

snakers4 (Alexander), July 29, 13:17

2019 DS / ML digest 13


Highlights of the week(s):

- x10 faster STT network?

- Train on 1/2 of test resolution - new down-to-earth SOTA approach to image classification? Old news!;

- New workhorse light-weight network - MixNet?



snakers4 (Alexander), July 12, 05:25

Installing apex ... in style )

Sometimes you just need to try fp16 training (GANs, large networks, rare cases).

There is no better way to do this than use Nvidia's APEX library.

Luckily - they have very nice examples:


Well ... it installs on a clean machine, but I want my environment to work with this always)

So, I ploughed through all the conda / environment setup mumbo-jumbo and created a version of our deep-learning / ds dockerfile, but now instlalling from pytorch image (pytorch GPU / CUDA / CUDNN + APEX).


It was kind of painful, because PyTorch images already contain conda / pip and it was not apparent at first, causing all sorts of problems with my miniconda instalation.

So use it and please report if it is still buggy.




A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex

Logging your hardware, with logs, charts and alers - in style

TLDR - we have been looking for THE software to do this easily, with charts / alerts / easy install.

We found prometheus. Configuring alerts was a bit of a problem, but enjoy:





Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

snakers4 (Alexander), July 03, 13:57

2019 DS / ML digest 12

Highlights of the week(s)

- Cool STT papers;

- End of AI hype?

- How to download tons of images from Google;



2019 DS/ML digest 12

2019 DS/ML digest 12 Статьи автора - Блог -

snakers4 (Alexander), July 02, 09:15

A cool old paper - FCN text detector

They were using multi-layer masks for better semantic segmentation supervision before it was mainstream.

Very cool!

Too bad such models are a commodity now, you can just use pre-trained)


EAST: An Efficient and Accurate Scene Text Detector

Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even...

snakers4 (Alexander), July 02, 07:34

New version of our open STT dataset - 0.5, now in beta

Please share and repost!

What is new?

- A new domain - radio (1000+ new hours);

- A larger YouTube dataset with 1000+ additional hours;

- A small (300 hours) YouTube dataset downloaded in maximum quality;

- Ground truth validation sets for YouTube / books / public calls manually annotated;

- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;

I'm back from vacation)





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 27, 08:43

2019 DS / ML digest 11

Highlights of the week(s)

- New attention block for CV;

- Reducing the amount of data for CV 10x?;

- Brain-to-CNN interfaces start popping up in the mainstream;



2019 DS/ML digest 11

2019 DS/ML digest 11 Статьи автора - Блог -

snakers4 (Alexander), May 24, 09:20

Really working in the wild audio noise reduction libraries

Spectral gating

It works. But you need a sample of your noise.

Will work well out of box for larger files / files with gaps where you can pay attention to each file and select a part of file that would act as noise example.

RNNoise: Learning Noise Suppression

Works with any arbitrary noise. Just feed your file.

It works more like adative equalizer.

It filters noise when there is no speech.

But it mostly does not change audio when speech is present.

As authors explain, it improves snr overall and makes sound less "tiring" to listen.

Description / blog posts



Step-by-step instructions in python





Noise reduction / speech enhancement for python using spectral gating - timsainb/noisereduce

snakers4 (Alexander), May 20, 06:21

New in our Open STT dataset

- An mp3 version of the dataset;

- A torrent for mp3 dataset;

- A torrent for the original wav dataset;

- Benchmarks on the public dataset / files with "poor" annotation marked;





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 14, 03:40

2019 DS / ML digest 10

Highlights of the week(s)

- New MobileNet;

- New PyTorch release;

- Practical GANs?;



2019 DS/ML digest 10

2019 DS/ML digest 10 Статьи автора - Блог -

snakers4 (Alexander), May 09, 11:28 / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made post:


Also our post was accepted to an editor's pick part of TDS:


Share / give us a star / clap if you have not already!

Original release




Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе занима...

snakers4 (Alexander), May 09, 10:51

PyTorch DP / DDP / model parallel

Finally they made proper tutorials:




Model parallel = have parts of the same model on different devices

Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process

Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine


The State of ML, eof 2018 in Russian

Quite down-to-earth and clever lecture

Some nice examples for TTS and some interesting forecasts (some of them happened already).


Сергей Марков: "Искусственный интеллект и машинное обучение: итоги 2018 года."
Лекция состоялась в научно-популярном лектории центра "Архэ" ( 16 января 2019 года. Лектор: Сергей Марков — автор одной из сильнейших росс...

snakers4 (Alexander), May 03, 08:58


PyTorch 1.1

- Tensorboard (beta);

- DistributedDataParallel new functionality and tutorials;

- Multi-headed attention;

- EmbeddingBag enhancements;

- Other cool, but more niche features:

- nn.SyncBatchNorm;

- optim.lr_scheduler.CyclicLR;



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), May 02, 06:02

Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.

It was a lot of work.

The dataset:

Accompanying post:


- On third release, we have ~4000 hours;

- Contributors and help wanted;

- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.






Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 02, 05:41

Poor man's computing cluster

So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).

It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.

So, why I am saying this?

Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.

Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).

Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).

It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.

Now let's crunch the numbers

According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.

So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:

- 20-25% of DGX desktop;

- 1 month of renting on Amazon;

Assuming that all the hardware will just break in a year:

- It is 4-5x cheaper than buying from Nvidia;

- It is 10x cheaper than renting;

If you buy everything used, then it is 10x and 20x cheaper!

I would buy that for a dollar!

Ofc you have to invest your free time.

See my calculations here:




config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...

snakers4 (Alexander), April 22, 11:44

Cool docker function

View aggregate load stats by container


docker stats

Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...

2019 DS / ML digest 9

Highlights of the week

- Stack Overlow survey;

- Unsupervised STT (ofc not!);

- A mix between detection and semseg?;



2019 DS/ML digest 09

2019 DS/ML digest 09 Статьи автора - Блог -

snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.



Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery

snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;



2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - Блог -

snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;



snakers4 (Alexander), March 26, 15:30


Updated my DL/ML dockerfile with

- cuda 10

- PyTorch 1.0

TF now also works with cuda 10



Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:


2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;



2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:


Weight normalization (used in TCN

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;



snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;




2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 17, 15:40

New large dataset for you GAN or pix2pix pet project

500k fashion images + meta-data + landmarks



DeepFashion2 Dataset - switchablenorms/DeepFashion2

older first