Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1818 members, 1744 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf

Posts by tag «deep_learning»:

snakers4 (Alexander), May 27, 08:43

2019 DS / ML digest 11

Highlights of the week(s)

- New attention block for CV;

- Reducing the amount of data for CV 10x?;

- Brain-to-CNN interfaces start popping up in the mainstream;

spark-in.me/post/2019_ds_ml_digest_11

#digest

#deep_learning

2019 DS/ML digest 11

2019 DS/ML digest 11 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), May 24, 09:20

Really working in the wild audio noise reduction libraries

Spectral gating

github.com/timsainb/noisereduce

It works. But you need a sample of your noise.

Will work well out of box for larger files / files with gaps where you can pay attention to each file and select a part of file that would act as noise example.

RNNoise: Learning Noise Suppression

Works with any arbitrary noise. Just feed your file.

It works more like adative equalizer.

It filters noise when there is no speech.

But it mostly does not change audio when speech is present.

As authors explain, it improves snr overall and makes sound less "tiring" to listen.

Description / blog posts

- people.xiph.org/~jm/demo/rnnoise/

- github.com/xiph/rnnoise

Step-by-step instructions in python

- github.com/xiph/rnnoise/issues/69

#audio

#deep_learning

timsainb/noisereduce

Noise reduction / speech enhancement for python using spectral gating - timsainb/noisereduce


snakers4 (Alexander), May 20, 06:21

New in our Open STT dataset

github.com/snakers4/open_stt#updates

- An mp3 version of the dataset;

- A torrent for mp3 dataset;

- A torrent for the original wav dataset;

- Benchmarks on the public dataset / files with "poor" annotation marked;

#deep_learning

#data_science

#dataset

snakers4/open_stt

Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.


snakers4 (Alexander), May 14, 03:40

2019 DS / ML digest 10

Highlights of the week(s)

- New MobileNet;

- New PyTorch release;

- Practical GANs?;

spark-in.me/post/2019_ds_ml_digest_10

#digest

#deep_learning

2019 DS/ML digest 10

2019 DS/ML digest 10 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), May 09, 11:28

Habr.com / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made habr.com post:

- habr.com/ru/post/450760/

Also our post was accepted to an editor's pick part of TDS:

- bit.ly/ru_open_stt

Share / give us a star / clap if you have not already!

Original release

github.com/snakers4/open_stt/

#deep_learning

#data_science

#dataset

Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе занима...


snakers4 (Alexander), May 09, 10:51

PyTorch DP / DDP / model parallel

Finally they made proper tutorials:

- pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

- pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

- pytorch.org/tutorials/intermediate/ddp_tutorial.html

Model parallel = have parts of the same model on different devices

Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process

Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine

#deep_learning

The State of ML, eof 2018 in Russian

Quite down-to-earth and clever lecture

www.youtube.com/watch?v=l6djLCYnOKw

Some nice examples for TTS and some interesting forecasts (some of them happened already).

#deep_learning

Сергей Марков: "Искусственный интеллект и машинное обучение: итоги 2018 года."
Лекция состоялась в научно-популярном лектории центра "Архэ" (http://arhe.msk.ru) 16 января 2019 года. Лектор: Сергей Марков — автор одной из сильнейших росс...

snakers4 (Alexander), May 03, 08:58

PyTorch

PyTorch 1.1

github.com/pytorch/pytorch/releases/tag/v1.1.0

- Tensorboard (beta);

- DistributedDataParallel new functionality and tutorials;

- Multi-headed attention;

- EmbeddingBag enhancements;

- Other cool, but more niche features:

- nn.SyncBatchNorm;

- optim.lr_scheduler.CyclicLR;

#deep_learning

pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch


snakers4 (Alexander), May 02, 06:02

Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.

It was a lot of work.

The dataset:

github.com/snakers4/open_stt/

Accompanying post:

spark-in.me/post/russian-open-stt-part1

TLDR:

- On third release, we have ~4000 hours;

- Contributors and help wanted;

- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.

#stt

#asr

#data_science

#deep_learning

snakers4/open_stt

Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.


snakers4 (Alexander), May 02, 05:41

Poor man's computing cluster

So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).

It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.

So, why I am saying this?

Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.

Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).

Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).

It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.

Now let's crunch the numbers

According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.

So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:

- 20-25% of DGX desktop;

- 1 month of renting on Amazon;

Assuming that all the hardware will just break in a year:

- It is 4-5x cheaper than buying from Nvidia;

- It is 10x cheaper than renting;

If you buy everything used, then it is 10x and 20x cheaper!

I would buy that for a dollar!

Ofc you have to invest your free time.

See my calculations here:

bit.ly/spark00001

#deep_learning

#hardware

computing_cluster

config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...


snakers4 (Alexander), April 22, 11:44

Cool docker function

View aggregate load stats by container

docs.docker.com/engine/reference/commandline/stats/

#linux

docker stats

Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...


2019 DS / ML digest 9

Highlights of the week

- Stack Overlow survey;

- Unsupervised STT (ofc not!);

- A mix between detection and semseg?;

spark-in.me/post/2019_ds_ml_digest_09

#digest

#deep_learning

2019 DS/ML digest 09

2019 DS/ML digest 09 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.

github.com/NVIDIA/sentiment-discovery/blob/master/analysis/scale.md

github.com/SeanNaren/deepspeech.pytorch/issues/211

#deep_learning

NVIDIA/sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery


snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;

spark-in.me/post/2019_ds_ml_digest_08

#digest

#deep_learning

2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;

#nlp

#deep_learning

snakers4 (Alexander), March 26, 15:30

Dockerfile

Updated my DL/ML dockerfile with

- cuda 10

- PyTorch 1.0

github.com/snakers4/gpu-box-setup/

TF now also works with cuda 10

#deep_learning

snakers4/gpu-box-setup

Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.


snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:

www.statsmodels.org/devel/examples/notebooks/generated/ols.html

#data_science

2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;

spark-in.me/post/2019_ds_ml_digest_07

#digest

#deep_learning

2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:

(pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)

Weight normalization (used in TCN arxiv.org/abs/1602.07868):

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](arxiv.org/abs/1607.08022))

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](arxiv.org/abs/1607.06450))

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;

#deep_learning

#nlp

snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;

spark-in.me/post/2019_ds_ml_digest_06

#digest

#data_science

#deep_learning

2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 17, 15:40

New large dataset for you GAN or pix2pix pet project

500k fashion images + meta-data + landmarks

github.com/switchablenorms/DeepFashion2

#deep_learning

switchablenorms/DeepFashion2

DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf - switchablenorms/DeepFashion2


snakers4 (Alexander), March 07, 15:42

Our experiments with Transformers, BERT and generative language pre-training

TLDR

For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.

On the other hand we have definitively shown that:

- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;

- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;

- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;

spark-in.me/post/bert-pretrain-ru

All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.

#deep_learning

Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages

Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages. Towards a new state of the art? Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


An approach to ranking search results with no annotation

Just a small article with a novel idea:

- Instead of training a network with CE - just train it with BCE;

- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);

spark-in.me/post/classifier-result-sorting

Works best if your ontology is relatively simple.

#deep_learning

Learning to rank search results without annotation

Solving search ranking problem Статьи автора - http://spark-in.me/author/adamnsandle Блог - http://spark-in.me


snakers4 (Alexander), March 07, 11:21

Inception v1 layers visualized on a map

A joint work by Google and OpenAI:

distill.pub/2019/activation-atlas/

distill.pub/2019/activation-atlas/app.html

blog.openai.com/introducing-activation-atlases/

ai.googleblog.com/2019/03/exploring-neural-networks.html

TLDR:

- Take 1M random images;

- Feed to a CNN, collect some spatial activation;

- Produce a corresponding idealized image that would result in such an activation;

- Plot in 2D (via UMAP), add grid, averaging, etc etc;

#deep_learning

Activation Atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.


snakers4 (Alexander), March 07, 09:58

Russian STT datasets

Anyone knows more proper datasets?

I found this (60 hours), but I could not find the link to the dataset:

www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf

Anyway, here is the list I found:

- 20 hours of Bible github.com/festvox/datasets-CMU_Wilderness;

- www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset - does not say how many hours

- Ofc audio book datasets - www.caito.de/data/Training/stt_tts/ + and some scraping scripts github.com/ainy/shershe/tree/master/scripts

- And some disappointment here voice.mozilla.org/ru/languages

#deep_learning

Download 274_Paper.pdf 0.31 MB

snakers4 (Alexander), March 07, 06:47

PyTorch internals

speakerdeck.com/perone/pytorch-under-the-hood

#deep_learning

PyTorch under the hood

Presentation about PyTorch internals presented at the PyData Montreal in Feb 2019.


snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;

spark-in.me/post/2019_ds_ml_digest_05

#digest

#data_science

#deep_learning

2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 28, 07:16

LSTM vs TCN vs Trellis network

- Did not try the Trellis network - decided it was too complex;

- All the TCN properties from the digest spark-in.me/post/2018_ds_ml_digest_31 hold - did not test for very long sequences;

- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;

- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;

github.com/locuslab/TCN/blob/master/TCN/tcn.py

#deep_learning

2018 DS/ML digest 31

2018 DS/ML digest 31 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 27, 07:50

New variation of Adam?

- [Website](www.luolc.com/publications/adabound/);

- [Code](github.com/Luolc/AdaBound);

- Eliminate the generalization gap between adaptive methods and SGD;

- TL;DR: A Faster And Better Optimizer with Highly Robust Performance;

- Dynamic bound on learning rates. Inspired by gradient clipping;

- Not very sensitive to the hyperparameters, especially compared with Sgd(M);

- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;

#deep_learning

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.


snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;

spark-in.me/post/2019_ds_ml_digest_04

#digest

#data_science

#deep_learning

2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 17, 10:22

A bit of lazy Sunday admin stuff

Monitoring you CPU temperature with email notifications

- Change CPU temp to any metric you like

- Rolling log

- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)

gist.github.com/snakers4/cf0ffd57c3ef7f4e2e25f6b3347dcdec

Setting up a GPU box on Ubuntu 18.04 from scratch

github.com/snakers4/gpu-box-setup/

#deep_learning

#linux

Plain temperature monitoring in Ubuntu 18.04

Plain temperature monitoring in Ubuntu 18.04. GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), February 13, 09:02

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.

Just use nvidia's apex, DistributedDataParallel, DataParallel.

Best examples [here](github.com/huggingface/pytorch-pretrained-BERT).

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.

Why?

So that you can seamleassly you all the abstractions from (1) with ease.

Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':
model.train()
elif loop_type=='Val':
model.eval()

with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here
pass

(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.

#nlp

#deep_learning

huggingface/pytorch-pretrained-BERT

📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT


PyTorch DataLoader, GIL thrashing and CNNs

Well all of this seems a bit like magic to me, but hear me out.

I abused my GPU box for weeks running CNNs on 2-4 GPUs.

Nothing broke.

And then my GPU box started shutting down for no apparent reason.

No, this was not:

- CPU overheating (I have a massive cooler, I checked - it works);

- PSU;

- Overclocking;

- It also adds to confusion that AMD has weird temperature readings;

To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with workers > 0 it can lead to system instability instead of speeding up.

It is obvious in retrospect, but it is not when you face this issue.

#deep_learning

#pytorch

snakers4 (Alexander), February 11, 06:22

Old news ... but Attention works

Funny enough, but in the past my models :

- Either did not need attention;

- Attention was implemented by @thinline72 ;

- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:

gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4

#nlp

#deep_learning

SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;

spark-in.me/post/2019_ds_ml_digest_03

#digest

#data_science

#deep_learning

2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


older first