Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1823 members, 1768 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

snakers4 (Alexander), September 13, 06:37

2019 DS / ML digest 15


Highlights of the week(s):

- Facebook's upcoming deep fake detection challenge;

- Lyft competion on Kaggle;

- Waymo open-sources its data;

- Cool ways to deal with imbalanced data and noisy data;



2019 DS/ML digest 15

2019 DS/ML digest 15 Статьи автора - Блог -

snakers4 (Alexander), September 06, 08:06

Forwarded from Just links:

Class-Balanced Loss Based on Effective Number of Samples

With the rapid increase of large-scale, real-world datasets, it becomes critical to address the problem of long-tailed data distribution (i.e., a few classes account for most of the data, while...

snakers4 (Alexander), September 03, 16:47

Support Open STT

Now you can support Open STT on our github page via opencollective!

Opencollective seemed to be the best platform supported by GitHub for now.


snakers4 (Alexander), September 03, 14:59

Now they stack ... normalization!

Tough to choose between BN / LN / IN?

Now a stacked version with attention exists!

Also, their 1D implementation does not work, but you can hack their 2D (actually BxCxHxW) layer to work with 1D (actually BxCxW) data =)



Code for Switchable Normalization from "Differentiable Learning-to-Normalize via Switchable Normalization", - switchablenorms/Switchable-Normalization

snakers4 (Alexander), August 30, 06:54

ML without train / val split

Yeah, I am not crazy. But probably this applies only to NLP.

Sometimes you just need your pipeline to be flexible enough to work with any possible "in the wild" data.

A cool and weird trick - if you can make your dataset so large that your model just MUST generalize to work on it, then you do not need a validation set.

If you sample data randomly and your data generator is good enough, each new batch is just random and can serve as validation.


snakers4 (Alexander), August 27, 07:16

Poor man's ensembling techniques

So you want to improve your model's performance a bit.

Ensembling helps. But as is ... it's useful only on Kaggle competitions, where people stack over9000 networks trained on 100MB of data.

But for real life usage / production, there exist ensembling techniques, that do not require significant computation cost increase (!).

All of this is not mainstream yet, but it may work on you dataset!

Especially if your task is easy and the dataset is small.

- SWA (proven to work, usually used as a last stage when training a model);

- Lookahead optimizer (kind of new, not thoroughly tested);

- Multi-Sample Dropout (seems like a cheap ensemble, should work for classification);

Applicability will vary with your task.

Plain vanilla classification can use all of these, s2s networks probably only partially.



An open source deep learning platform that provides a seamless path from research prototyping to production deployment.

snakers4 (Alexander), August 23, 12:24

How to solve an arbitrary CV task ...

- W/o annotation

- W/o GPUs in production

- And make your model work in real life and help people



How to get your own image classifier / region labelling model without annotation

Multi-head model with a light and fast encoder without annotation/ deploy on CPU Статьи автора - Блог -

snakers4 (Alexander), August 23, 07:14

Our STT Dark Forest post on TDS

Please 👏x50 if you have an account


Navigating the Speech to Text Dark Forest

Make your ASR network 4x faster, 5x smaller and 10x cooler

snakers4 (Alexander), August 21, 07:12

2019 DS / ML digest 14


Highlights of the week(s):

- FAIR embraces embedding bags for misspellings;

- New version of Adam - RAdam. But on the only real test author has concluded (Imagenet) - SGD is better;

- Yet another LSTM replacement - SRU. Similar to QRRN - it requires additional dependencies;



2019 DS/ML digest 14

2019 DS/ML digest 14 Статьи автора - Блог -

snakers4 (Alexander), August 19, 10:14

Sampler - visualization for any shell command

A cool mix between glances and prometheus



A tool for shell commands execution, visualization and alerting. Configured with a simple YAML file. - sqshq/sampler

snakers4 (Alexander), August 15, 14:42

My foray into the STT Dark Forest

My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality




Navigating the Speech to Text Dark Forest

A tongue-in-cheek description of our STT path Статьи автора - Блог -

snakers4 (Alexander), August 11, 04:42

Extreme NLP network miniaturization

Tried some plain RNNs on a custom in the wild NER task.

The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.

I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.

What is interesting:

- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;

- Model works with various hidden sizes

- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;

- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;

As added bonus - you can just store such miniature model in git w/o lfs.

What is with training transformers on US$250k worth of compute credits you say?)




A new model for word embeddings that are resilient to misspellings

Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.

snakers4 (Alexander), August 09, 03:27

PyTorch 1.2 release


Key features:

- Tensorboard logging in now out of beta;

- They continue improving JIT and ONNX;

- NN.Transformer is a layer now;

- Looks like SyncBn is also more or less stable;

- nn.Embedding: support float16 embeddings on CUDA;

- AdamW;

- Numpy compatibility;



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), August 02, 11:19

Managing your DS / ML environment neatly and in style

If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.

You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).

But what you have to do this for several people? And use it with a proper IDE via ssh?

A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.

And also you have to make your ssh daemon works inside of a container as a second service.

So I solved these "challenges" and created 2 public layers so far:

- Basic DS / ML layer - FROM aveysov/ml_images:layer-0 - from dockerfile;

- DS / ML libraries - FROM aveysov/ml_images:layer-0- from dockerfile;

Your final dockerfile may look something like this just pulling from any of those layers.

Note that when building this, you will need to pass your UID as a variable, e.g.:

docker build --build-arg NB_UID=1000 -t av_final_layer -f Layer_final.dockerfile .

When launched, this launched a notebook with extensions. You can just exec into the machine itself to run scripts or use an ssh daemon inside (do not forget to add your ssh key and service ssh start).




Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

Using public Dockerhub account for your private small scale deploy

Also a lifehack - you can just use Dockerhub for your private stuff, just separate the public part and the private part.

Push the public part (i.e. libraries and frameworks) to Dockerhub/

You private Dockerfile will be then something like:

FROM your_user/your_repo:latest

COPY your_app_folder your_app_folder


CMD ["python3", ""]

snakers4 (Alexander), July 29, 13:17

2019 DS / ML digest 13


Highlights of the week(s):

- x10 faster STT network?

- Train on 1/2 of test resolution - new down-to-earth SOTA approach to image classification? Old news!;

- New workhorse light-weight network - MixNet?



snakers4 (Alexander), July 18, 04:55

An ideal remote IDE?


No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.

I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.

The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.

So for now (it is personal) - best tools are in my opinion:

- Notebooks - for exploration and testing;

- VScode for codebase;

- Atom - for local scripts;


Visual Studio Code Remote Development

snakers4 (Alexander), July 16, 12:53

Full IDE in a browser?


You all know all the pros and cons of:

- IDEs (PyCharm);

- Advanced text editors (Atom, Sublime Text);

- Interactive environments (notebook / lab, Atom + Hydrogen);

I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).

But what if I told you there is a third option? =)

If you work as a team on a remote machine / set of machines?

TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.

Now you can just run it with one command.


- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);

- Pre-built images available;

- It is extendible - new modules get released - you can build yourself or just find a build;

- It has extensive linting, python language server (just a standard library though);

- It has full text search ... kind of;

- Follow definition in your code;

- Docstrings and auto-complete work for your modules and standard library (not for you packages);

Looks cool af!

If they ship a build with a remote python kernel, then it will be a perfect option for teams!

I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).


- Website;

- Pre-built apps for python;

- Language server they are using;


Theia - Cloud and Desktop IDE

Theia is an open-source cloud   desktop IDE framework implemented in TypeScript.

If you know how to add your python kernel to Theia - please ping me)

snakers4 (Alexander), July 15, 04:48

Trying to migrate to JupyterLab from Jupyter Notebook?

Some time ago I noticed that the Jupyter extensions project was more or less frozen => JupyterLab obviously is trying to shift community attention to npm / nodejs plugins.

Again (like 6-12 months ago) I tried to do this.

This time Lab is more mature:

- Now at version >1;

- Now they have built-in package manager;

- They have some of the most necessary extensions (i.e. git, toc, google drive, etc);

- UI got polished a bit, but window in a window still produces a bit of mental friction. Only the most popular file formats are supported. Text editor inherited the best features, but it is still a bit rudimentary;

- Full screen width by default;

- Some useful things (like codefolding) are now turned on in settings json file;

- Using these extensions is a bit of a chore in edge cases (i.e. some user permission problems / you have to re-build an app each time you add an extensions);

But I could not switch mostly for one reason - this one


If you have a Jupyter environment it is very easy to switch. For me, before it was:

# 5.6 because otherwise I have a bug with installing extensions
RUN conda install notebook=5.6

RUN pip install git+ && \
jupyter contrib nbextension install --user

CMD jupyter notebook --port=8888 --ip= --no-browser

And it just became:

RUN conda install -c conda-forge jupyterlab

CMD jupyter lab --port=8888 --ip= --no-browser


Support collapsible hierarchy of sections · Issue #2275 · jupyterlab/jupyterlab

Allow users to toggle open/close sections by clicking on some kind of UI element. This helps with navigating and organizing large notebooks.

snakers4 (Alexander), July 12, 05:25

Installing apex ... in style )

Sometimes you just need to try fp16 training (GANs, large networks, rare cases).

There is no better way to do this than use Nvidia's APEX library.

Luckily - they have very nice examples:


Well ... it installs on a clean machine, but I want my environment to work with this always)

So, I ploughed through all the conda / environment setup mumbo-jumbo and created a version of our deep-learning / ds dockerfile, but now instlalling from pytorch image (pytorch GPU / CUDA / CUDNN + APEX).


It was kind of painful, because PyTorch images already contain conda / pip and it was not apparent at first, causing all sorts of problems with my miniconda instalation.

So use it and please report if it is still buggy.




A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex

Logging your hardware, with logs, charts and alers - in style

TLDR - we have been looking for THE software to do this easily, with charts / alerts / easy install.

We found prometheus. Configuring alerts was a bit of a problem, but enjoy:





Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

snakers4 (Alexander), July 09, 11:29

Yeah, scraping image labels from Google / other social networks is a really cool idea ...

Forwarded from Just links:

Ross Wightman

@facebookai ResNeXt models pre-trained on Instagram hashtags stand out in their ability to generalized to the 'ImageNetV2' test set. Thanks @beccaroelofs @vaishaal @beenwrekt @lschmidt3 for a useful dataset. #PyTorch

snakers4 (Alexander), July 03, 13:57

2019 DS / ML digest 12

Highlights of the week(s)

- Cool STT papers;

- End of AI hype?

- How to download tons of images from Google;



2019 DS/ML digest 12

2019 DS/ML digest 12 Статьи автора - Блог -

snakers4 (Alexander), July 02, 09:15

A cool old paper - FCN text detector

They were using multi-layer masks for better semantic segmentation supervision before it was mainstream.

Very cool!

Too bad such models are a commodity now, you can just use pre-trained)


EAST: An Efficient and Accurate Scene Text Detector

Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even...

snakers4 (Alexander), July 02, 07:34

New version of our open STT dataset - 0.5, now in beta

Please share and repost!

What is new?

- A new domain - radio (1000+ new hours);

- A larger YouTube dataset with 1000+ additional hours;

- A small (300 hours) YouTube dataset downloaded in maximum quality;

- Ground truth validation sets for YouTube / books / public calls manually annotated;

- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;

I'm back from vacation)





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), June 21, 06:47

Forwarded from Just links:

I've uploaded the weights I've got so far (73.53% top-1)

snakers4 (Alexander), June 17, 12:35

This AI Makes Amazing DeepFakes…and More
Check out Lambda Labs here: 📝 The paper "Deferred Neural Rendering: Image Synthesis using Neural Textures" is available here: h...

snakers4 (Alexander), June 02, 05:20

Forwarded from Just links:


A PyTorch implementation of EfficientNet. Contribute to lukemelas/EfficientNet-PyTorch development by creating an account on GitHub.

snakers4 (Alexander), May 27, 08:43

2019 DS / ML digest 11

Highlights of the week(s)

- New attention block for CV;

- Reducing the amount of data for CV 10x?;

- Brain-to-CNN interfaces start popping up in the mainstream;



2019 DS/ML digest 11

2019 DS/ML digest 11 Статьи автора - Блог -

snakers4 (Alexander), May 27, 08:17

Do not use AllenNLP though

Forwarded from Neural Networks Engineering:

​​Have finished building demo and landing page for my project on mention classification. The idea of this project is to create a model which can assign some labels to objects based on their mentions in context. Right now it works only for people mentions, but if I find interest in this work, I will extend the model to other types like organizations or events. For now, you can check out the online demo of the neural network.

The current implementation can take account of several mentions at a time, so it can distinguish relevant parts of the context, not just averaging prediction.

It's also open sourced, and built with AllenNLP framework from training to serving. Take a look at it.

More technical details of implementation coming later.

snakers4 (Alexander), May 24, 09:20

Really working in the wild audio noise reduction libraries

Spectral gating

It works. But you need a sample of your noise.

Will work well out of box for larger files / files with gaps where you can pay attention to each file and select a part of file that would act as noise example.

RNNoise: Learning Noise Suppression

Works with any arbitrary noise. Just feed your file.

It works more like adative equalizer.

It filters noise when there is no speech.

But it mostly does not change audio when speech is present.

As authors explain, it improves snr overall and makes sound less "tiring" to listen.

Description / blog posts



Step-by-step instructions in python





Noise reduction / speech enhancement for python using spectral gating - timsainb/noisereduce

snakers4 (Alexander), May 22, 15:06

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Statement regarding the purpose and effect of the technology (NB: this statement reflects personal opinions of the authors and not of their organizations) We...

older first