Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1823 members, 1713 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

snakers4 (Alexander), May 22, 15:06

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Paper: Authors: Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky Abstract: Several recent works have sho...

snakers4 (Alexander), May 20, 06:21

New in our Open STT dataset

- An mp3 version of the dataset;

- A torrent for mp3 dataset;

- A torrent for the original wav dataset;

- Benchmarks on the public dataset / files with "poor" annotation marked;





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 19, 16:03

Forwarded from Just links:

An open source deep learning platform that provides a seamless path from research prototyping to production deployment.

SWA in contrib repo of pytorch )

snakers4 (Alexander), May 14, 03:40

2019 DS / ML digest 10

Highlights of the week(s)

- New MobileNet;

- New PyTorch release;

- Practical GANs?;



2019 DS/ML digest 10

2019 DS/ML digest 10 Статьи автора - Блог -

snakers4 (Alexander), May 09, 11:28 / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made post:


Also our post was accepted to an editor's pick part of TDS:


Share / give us a star / clap if you have not already!

Original release




Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе занима...

snakers4 (Alexander), May 09, 10:51

PyTorch DP / DDP / model parallel

Finally they made proper tutorials:




Model parallel = have parts of the same model on different devices

Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process

Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine


The State of ML, eof 2018 in Russian

Quite down-to-earth and clever lecture

Some nice examples for TTS and some interesting forecasts (some of them happened already).


Сергей Марков: "Искусственный интеллект и машинное обучение: итоги 2018 года."
Лекция состоялась в научно-популярном лектории центра "Архэ" ( 16 января 2019 года. Лектор: Сергей Марков — автор одной из сильнейших росс...

snakers4 (Alexander), May 03, 08:58


PyTorch 1.1

- Tensorboard (beta);

- DistributedDataParallel new functionality and tutorials;

- Multi-headed attention;

- EmbeddingBag enhancements;

- Other cool, but more niche features:

- nn.SyncBatchNorm;

- optim.lr_scheduler.CyclicLR;



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), May 02, 06:02

Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.

It was a lot of work.

The dataset:

Accompanying post:


- On third release, we have ~4000 hours;

- Contributors and help wanted;

- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.






Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 02, 05:41

Poor man's computing cluster

So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).

It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.

So, why I am saying this?

Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.

Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).

Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).

It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.

Now let's crunch the numbers

According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.

So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:

- 20-25% of DGX desktop;

- 1 month of renting on Amazon;

Assuming that all the hardware will just break in a year:

- It is 4-5x cheaper than buying from Nvidia;

- It is 10x cheaper than renting;

If you buy everything used, then it is 10x and 20x cheaper!

I would buy that for a dollar!

Ofc you have to invest your free time.

See my calculations here:




config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...

snakers4 (Alexander), May 01, 05:49

correct link

streaming STT lecture now

Deep Learning на пальцах 11 - Аудио и распознавание речи (Юрий Бабуров)
Курс: Слайды:

snakers4 (Alexander), April 30, 09:27

Tricky rsync flags

Rsync is the best program ever.

I find these flags the most useful

--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)

Sometimes first three flags get confusing.


More about STT from also us ... soon)

Forwarded from Yuri Baburov:

Вторая экспериментальная гостевая лекция курса.

Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио.

1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST).

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)

snakers4 (Alexander), April 22, 11:44

Cool docker function

View aggregate load stats by container


docker stats

Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...

2019 DS / ML digest 9

Highlights of the week

- Stack Overlow survey;

- Unsupervised STT (ofc not!);

- A mix between detection and semseg?;



2019 DS/ML digest 09

2019 DS/ML digest 09 Статьи автора - Блог -

snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page[]=year%3A%222018%22

Donate them here




Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...

snakers4 (Alexander), April 17, 08:47

Using snakeviz for profiling Python code


To profile complicated and convoluted code.

Snakeviz is a cool GUI tool to analyze cProfile profile files.

Just launch your code like this

python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.


They have a server GUI and a jupyter notebook plugin.

Also you can launch their tool from within a docker container:

snakeviz -s -H profile_file.cprofile

Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.



SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.

snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.



Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery

snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;



2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - Блог -

snakers4 (Alexander), April 07, 12:55

Finally! Cool features like SyncBN or CyclicLR migrate to Pytorch!

Forwarded from Just links:

snakers4 (Alexander), March 31, 16:44

Overview of differential equations | Chapter 1
How do you study what cannot be solved? Home page: Brought to you by you: Need to brush up on calculus? htt...

snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;



snakers4 (Alexander), March 26, 15:30


Updated my DL/ML dockerfile with

- cuda 10

- PyTorch 1.0

TF now also works with cuda 10



Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

snakers4 (Alexander), March 26, 04:44

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.

Luckily, some anonymous backed the dataset up.

Anyway - use it.

Yeah, it is small. But it is free, so whatever.



Download Dataset.tar.gz 1.57 MB

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:


2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;



2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 24, 10:03

Wow, we are not alone with our love for Embedding bag!

Forwarded from Neural Networks Engineering:

FastText embeddings done right

An important feature of FastText embeddings is the usage of subword information.

In addition to the vocabulary FastText also contains word's ngrams.

This additional information is useful for the following: handling Out-Of-Vocabulary words, extracting sense from word's etymology and dealing with misspellings.

But unfortunately all this advantages are not used in most open source projects.

We can easily discover it via GitHub (pic.). The point is that regular Embedding layer maps the whole word into a single stored in memory fixed vector. In this case all the word vectors should be generated in advance, so none of the cool features work.

The good thing is that using FastText correctly is not so difficult! FacebookResearch provides an example of the proper way to use FastText in PyTorch framework.

Instead of Embedding you should choose EmbeddingBag layer. It will combine ngrams into single word vector which can be used as usual.

Now we will obtain all advantages in our neural network.


Library for fast text representation and classification. - facebookresearch/fastText

... or you can just extend collate_fn that is passed to DataLoader in pytorch =)

Forwarded from Neural Networks Engineering:

Parallel preprocessing with multiprocessing

Using multiple processes to construct train batches may significantly reduce total training time of your network.

Basically, if you are using GPU for training, you can reduce additional batch construction time almost to zero. This is achieved through pipelining of computations: while GPU crunches numbers, CPU makes preprocessing. Python multiprocessing module allows us to implement such pipelining as elegant as it is possible in the language with GIL.

PyTorch DataLoader class, for example, also uses multiprocessing in it's internals.

Unfortunately DataLoader suffers lack of flexibility. It's impossible to create batch with any complex structure within standard DataLoader class. So it should be useful to be able to apply raw multiprocessing.

multiprocessing gives us a set of useful APIs to distribute computations among several processes. Processes does not share memory with each other, so data is transmitted via inter-process communication protocols. For example in linux-like operation systems multiprocessing uses pipes. Such organization leads to some pitfalls that I am going to tell you.

* map vs imap

Methods map and imap may be used to apply preprocessing to batches. Both of them take processing function and iterable as argument. The difference is that imap is lazy. It will return processed elements as soon as they are ready. In this case all processed batched should not be stored in RAM simultaneously. For training NN you should always prefer imap:

def process(batch_reader):
with Pool(threads) as pool:
for batch in pool.imap(foo, batch_reader):
yield batch

* Serialization

Other pitfall is associated with the need to transfer objects via pipes. In addition to the processing results, multiprocessing will also serialize transformation object if it is used like this: pool.imap(, batch_reader). transformer will be serialized and send to subprocess. It may lead to some problems if transformer object has large properties. In this case it may be better to store large properties as singleton class variables:

class Transformer():
large_dictinary = None

def __init__(self, large_dictinary, **kwargs):
self.__class__.large_dictinary = large_dictinary

def foo(self, x):
y = self.large_dictinary[x]

Another difficulty that you may encounter is if the preprocessor is faster than GPU learning. In this case unprocessed batches accumulate in memory. If your memory is not to large enough you will get Out-of-Memory error. One way to solve this problem is to limit batch preprocessing until GPU learning is done.

Semaphore is perfect solution for this task:

def batch_reader(semaphore):
for batch in source:
yield batch

def process(x):
return x + 1

def pooling():
with Pool(threads) as pool:
semaphore = Semaphore(limit)
for x in pool.imap(plus, batch_reader(semaphore)):
yield x

for x in pooling():

Semaphore has internal counter syncronized across all working processes. It's logic will block execution if some process tries to increase counet value above limit with semaphore.acquire ()

snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:


Weight normalization (used in TCN

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;



snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;




2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 17, 15:40

New large dataset for you GAN or pix2pix pet project

500k fashion images + meta-data + landmarks



DeepFashion2 Dataset - switchablenorms/DeepFashion2

snakers4 (Alexander), March 17, 05:41

Cramer's rule, explained geometrically | Essence of linear algebra, chapter 12
This rule seems random to many students, but it has a beautiful reason for being true. Full series: Home page:

New video from 3B1B

Which is kind of relevant

snakers4 (Alexander), March 14, 03:58

GANPaint: An Extraordinary Image Editor AI
📝 The paper " GAN Dissection: Visualizing and Understanding Generative Adversarial Networks " and its web demo is available here: https://gandissect.csail.mi...

snakers4 (Alexander), March 12, 15:45

Our Transformer post was featured by Towards Data Science


Comparing complex NLP models for complex languages on a set of real tasks

Transformer is not yet really usable in practice for languages with rich morphology, but we take the first step in this direction

snakers4 (Alexander), March 12, 11:53

New tricks for training CNNs

Forwarded from Just links:

Bag of Tricks for Image Classification with Convolutional Neural Networks

Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the...

Forwarded from Just links:

DropBlock: A regularization method for convolutional networks

older first