Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1812 members, 1759 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf

snakers4 (Alexander), April 30, 09:27

Tricky rsync flags

Rsync is the best program ever.

I find these flags the most useful

--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)

Sometimes first three flags get confusing.

#linux

More about STT from also us ... soon)

Forwarded from Yuri Baburov:

Вторая экспериментальная гостевая лекция курса.

Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио.

1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST).

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)

www.youtube.com/watch?v=wm4H2Ym33Io

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)
Курс: http://dlcourse.ai

snakers4 (Alexander), April 22, 11:44

Cool docker function

View aggregate load stats by container

docs.docker.com/engine/reference/commandline/stats/

#linux

docker stats

Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...


2019 DS / ML digest 9

Highlights of the week

- Stack Overlow survey;

- Unsupervised STT (ofc not!);

- A mix between detection and semseg?;

spark-in.me/post/2019_ds_ml_digest_09

#digest

#deep_learning

2019 DS/ML digest 09

2019 DS/ML digest 09 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page

archive.org/details/twitterstream?and[]=year%3A%222018%22

Donate them here

archive.org/donate/

#data_science

#nlp

#nlp

Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...


snakers4 (Alexander), April 17, 08:47

Using snakeviz for profiling Python code

Why

To profile complicated and convoluted code.

Snakeviz is a cool GUI tool to analyze cProfile profile files.

jiffyclub.github.io/snakeviz/

Just launch your code like this

python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.

GUI

They have a server GUI and a jupyter notebook plugin.

Also you can launch their tool from within a docker container:

snakeviz -s -H 0.0.0.0 profile_file.cprofile

Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.

#data_science

SnakeViz

SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.


snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.

github.com/NVIDIA/sentiment-discovery/blob/master/analysis/scale.md

github.com/SeanNaren/deepspeech.pytorch/issues/211

#deep_learning

NVIDIA/sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery


snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;

spark-in.me/post/2019_ds_ml_digest_08

#digest

#deep_learning

2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), April 07, 12:55

Finally! Cool features like SyncBN or CyclicLR migrate to Pytorch!

Forwarded from Just links:

pytorch.org/docs/master/nn.html#torch.nn.SyncBatchNorm

snakers4 (Alexander), March 31, 16:44

www.youtube.com/watch?v=p_di4Zn4wz4

Differential equations, studying the unsolvable | DE1
An overview of what ODEs are all about Home page: https://3blue1brown.com/ Brought to you by you: http://3b1b.co/de1thanks Need to brush up on calculus? http...

snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;

#nlp

#deep_learning

snakers4 (Alexander), March 26, 15:30

Dockerfile

Updated my DL/ML dockerfile with

- cuda 10

- PyTorch 1.0

github.com/snakers4/gpu-box-setup/

TF now also works with cuda 10

#deep_learning

snakers4/gpu-box-setup

Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.


snakers4 (Alexander), March 26, 04:44

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.

Luckily, some anonymous backed the dataset up.

Anyway - use it.

Yeah, it is small. But it is free, so whatever.

#nlp

#data_science

Download Dataset.tar.gz 1.57 MB

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:

www.statsmodels.org/devel/examples/notebooks/generated/ols.html

#data_science

2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;

spark-in.me/post/2019_ds_ml_digest_07

#digest

#deep_learning

2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 24, 10:03

Wow, we are not alone with our love for Embedding bag!

Forwarded from Neural Networks Engineering:

FastText embeddings done right

An important feature of FastText embeddings is the usage of subword information.

In addition to the vocabulary FastText also contains word's ngrams.

This additional information is useful for the following: handling Out-Of-Vocabulary words, extracting sense from word's etymology and dealing with misspellings.

But unfortunately all this advantages are not used in most open source projects.

We can easily discover it via GitHub (pic.). The point is that regular Embedding layer maps the whole word into a single stored in memory fixed vector. In this case all the word vectors should be generated in advance, so none of the cool features work.

The good thing is that using FastText correctly is not so difficult! FacebookResearch provides an example of the proper way to use FastText in PyTorch framework.

Instead of Embedding you should choose EmbeddingBag layer. It will combine ngrams into single word vector which can be used as usual.

Now we will obtain all advantages in our neural network.

facebookresearch/fastText

Library for fast text representation and classification. - facebookresearch/fastText


... or you can just extend collate_fn that is passed to DataLoader in pytorch =)

Forwarded from Neural Networks Engineering:

Parallel preprocessing with multiprocessing

Using multiple processes to construct train batches may significantly reduce total training time of your network.

Basically, if you are using GPU for training, you can reduce additional batch construction time almost to zero. This is achieved through pipelining of computations: while GPU crunches numbers, CPU makes preprocessing. Python multiprocessing module allows us to implement such pipelining as elegant as it is possible in the language with GIL.

PyTorch DataLoader class, for example, also uses multiprocessing in it's internals.

Unfortunately DataLoader suffers lack of flexibility. It's impossible to create batch with any complex structure within standard DataLoader class. So it should be useful to be able to apply raw multiprocessing.

multiprocessing gives us a set of useful APIs to distribute computations among several processes. Processes does not share memory with each other, so data is transmitted via inter-process communication protocols. For example in linux-like operation systems multiprocessing uses pipes. Such organization leads to some pitfalls that I am going to tell you.

* map vs imap

Methods map and imap may be used to apply preprocessing to batches. Both of them take processing function and iterable as argument. The difference is that imap is lazy. It will return processed elements as soon as they are ready. In this case all processed batched should not be stored in RAM simultaneously. For training NN you should always prefer imap:

def process(batch_reader):
with Pool(threads) as pool:
for batch in pool.imap(foo, batch_reader):
....
yield batch
....

* Serialization

Other pitfall is associated with the need to transfer objects via pipes. In addition to the processing results, multiprocessing will also serialize transformation object if it is used like this: pool.imap(transformer.foo, batch_reader). transformer will be serialized and send to subprocess. It may lead to some problems if transformer object has large properties. In this case it may be better to store large properties as singleton class variables:

class Transformer():
large_dictinary = None

def __init__(self, large_dictinary, **kwargs):
self.__class__.large_dictinary = large_dictinary

def foo(self, x):
....
y = self.large_dictinary[x]
....

Another difficulty that you may encounter is if the preprocessor is faster than GPU learning. In this case unprocessed batches accumulate in memory. If your memory is not to large enough you will get Out-of-Memory error. One way to solve this problem is to limit batch preprocessing until GPU learning is done.

Semaphore is perfect solution for this task:

def batch_reader(semaphore):
for batch in source:
semaphore.acquire()
yield batch


def process(x):
return x + 1


def pooling():
with Pool(threads) as pool:
semaphore = Semaphore(limit)
for x in pool.imap(plus, batch_reader(semaphore)):
yield x
semaphore.release()


for x in pooling():
learn_gpu(x)

Semaphore has internal counter syncronized across all working processes. It's logic will block execution if some process tries to increase counet value above limit with semaphore.acquire ()

snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:

(pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)

Weight normalization (used in TCN arxiv.org/abs/1602.07868):

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](arxiv.org/abs/1607.08022))

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](arxiv.org/abs/1607.06450))

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;

#deep_learning

#nlp

snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;

spark-in.me/post/2019_ds_ml_digest_06

#digest

#data_science

#deep_learning

2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 17, 15:40

New large dataset for you GAN or pix2pix pet project

500k fashion images + meta-data + landmarks

github.com/switchablenorms/DeepFashion2

#deep_learning

switchablenorms/DeepFashion2

DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf - switchablenorms/DeepFashion2


snakers4 (Alexander), March 17, 05:41

youtu.be/jBsC34PxzoM

Cramer's rule, explained geometrically | Essence of linear algebra, chapter 12
This rule seems random to many students, but it has a beautiful reason for being true. Home page: https://www.3blue1brown.com/ Brought to you by you: http://...

New video from 3B1B

Which is kind of relevant

snakers4 (Alexander), March 14, 03:58

youtu.be/iM4PPGDQry0

GANPaint: An Extraordinary Image Editor AI
📝 The paper " GAN Dissection: Visualizing and Understanding Generative Adversarial Networks " and its web demo is available here: https://gandissect.csail.mi...

snakers4 (Alexander), March 12, 15:45

Our Transformer post was featured by Towards Data Science

medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b

#nlp

Comparing complex NLP models for complex languages on a set of real tasks

Transformer is not yet really usable in practice for languages with rich morphology, but we take the first step in this direction


snakers4 (Alexander), March 12, 11:53

New tricks for training CNNs

Forwarded from Just links:

arxiv.org/abs/1812.01187

Bag of Tricks for Image Classification with Convolutional Neural Networks

Much of the recent progress made in image classification research can be credited to training procedure refinements, such as changes in data augmentations and optimization methods. In the...


Forwarded from Just links:

DropBlock: A regularization method for convolutional networks arxiv.org/abs/1810.12890

snakers4 (Alexander), March 08, 14:38

Forwarded from Just links:

callingbullshit.org/index.html

Calling Bullshit: Data Reasoning in a Digital World

The world is awash in bullshit. Politicians are unconstrained by facts. Science is conducted by press release. Higher education rewards bullshit over analytic thought. Startup culture elevates bullshit to high art. Advertisers wink conspiratorially and invite us to join them in seeing through all the bullshit — and take advantage of our lowered guard to bombard us with bullshit of the second order. The majority of administrative activity, whether in private business or the public sphere, seems to be little more than a sophisticated exercise in the combinatorial reassembly of bullshit.


snakers4 (Alexander), March 08, 12:56

snakers4 (Alexander), March 07, 15:42

Our experiments with Transformers, BERT and generative language pre-training

TLDR

For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.

On the other hand we have definitively shown that:

- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;

- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;

- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;

spark-in.me/post/bert-pretrain-ru

All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.

#deep_learning

Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages

Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages. Towards a new state of the art? Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


An approach to ranking search results with no annotation

Just a small article with a novel idea:

- Instead of training a network with CE - just train it with BCE;

- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);

spark-in.me/post/classifier-result-sorting

Works best if your ontology is relatively simple.

#deep_learning

Learning to rank search results without annotation

Solving search ranking problem Статьи автора - http://spark-in.me/author/adamnsandle Блог - http://spark-in.me


snakers4 (Alexander), March 07, 11:21

Inception v1 layers visualized on a map

A joint work by Google and OpenAI:

distill.pub/2019/activation-atlas/

distill.pub/2019/activation-atlas/app.html

blog.openai.com/introducing-activation-atlases/

ai.googleblog.com/2019/03/exploring-neural-networks.html

TLDR:

- Take 1M random images;

- Feed to a CNN, collect some spatial activation;

- Produce a corresponding idealized image that would result in such an activation;

- Plot in 2D (via UMAP), add grid, averaging, etc etc;

#deep_learning

Activation Atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.


snakers4 (Alexander), March 07, 09:58

Russian STT datasets

Anyone knows more proper datasets?

I found this (60 hours), but I could not find the link to the dataset:

www.lrec-conf.org/proceedings/lrec2010/pdf/274_Paper.pdf

Anyway, here is the list I found:

- 20 hours of Bible github.com/festvox/datasets-CMU_Wilderness;

- www.kaggle.com/bryanpark/russian-single-speaker-speech-dataset - does not say how many hours

- Ofc audio book datasets - www.caito.de/data/Training/stt_tts/ + and some scraping scripts github.com/ainy/shershe/tree/master/scripts

- And some disappointment here voice.mozilla.org/ru/languages

#deep_learning

Download 274_Paper.pdf 0.31 MB

snakers4 (Alexander), March 07, 06:47

PyTorch internals

speakerdeck.com/perone/pytorch-under-the-hood

#deep_learning

PyTorch under the hood

Presentation about PyTorch internals presented at the PyData Montreal in Feb 2019.


snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;

spark-in.me/post/2019_ds_ml_digest_05

#digest

#data_science

#deep_learning

2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 05, 09:23

Anyone knows anyone from TopCoder?

As usual with competition platforms organization sometimes has its issues

Forwarded from Анна:

Привет!

Если кто не знает, кроме призовых за топ места, в спутниках была ещё одна классная фича - student's prize - приз для _студента_ с самым высоким скором. Там всё оказалось довольно неочевидно, отдельного лидерборда для студентов не было. Долго пыталась достучаться до админов, писала на почту, на форум, чтобы узнать больше подробностей. Спустя месяц админ таки ответил, что я единственный претендент на приз и, вроде, никаких проблем, всё улаживаем, кидай студак. И снова пропал. Периодически напоминала о своем существовании, интересовалась, как там дела, есть ли подвижки, в ответ игнор. *Ответа нет до сих пор.* Я впервые участвую в серьезном сореве и не совсем понимаю, что можно сделать в такой ситуации. Ждать новостей? Писать посты в твитер? Есть ли какой-то способ достучаться до админов?

Олсо, написала тут небольшую статейку про свое решение. spark-in.me/post/spacenet4

How I got to Top 10 in Spacenet 4 Challenge

Spacenet 4 Challenge: Building Footprints Статьи автора - http://spark-in.me/author/islanna Блог - http://spark-in.me


snakers4 (Alexander), March 04, 08:46

Tracking your hardware ... for data science

For a long time I though that if you really want to track all your servers' metrics you need Zabbix (which is very complicated).

A friend recommended me an amazing tool

- prometheus.io/docs/guides/node-exporter/

It installs and runs literally in minutes.

If you want to auto-start it properly, there are even a bit older Ubuntu packages and systemd examples

- github.com/prometheus/node_exporter/tree/master/examples/systemd

Dockerized metric exporters for GPUs by Nvidia

- github.com/NVIDIA/gpu-monitoring-tools/tree/master/exporters/prometheus-dcgm

It also features extensive alerting features, but they are very difficult to easily start, there being no minimal example

- prometheus.io/docs/alerting/overview/

- github.com/prometheus/docs/issues/581

#linux

Monitoring Linux host metrics with the Node Exporter | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.


snakers4 (Alexander), March 02, 04:49

youtu.be/eUzB0L0mSCI

Can You Recover Sound From Images?
Is it possible to reconstruct sound from high-speed video images? Part of this video was sponsored by LastPass: http://bit.ly/2SmRQkk Special thanks to Dr. A...

older first