Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1361 members, 1660 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

snakers4 (Alexander), October 05, 16:47

Russian post on Habr

habr.com/post/425507/

Please support if you have an account.

#nlp

Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget http...


snakers4 (Alexander), October 05, 16:29

Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.

#jobs

Forwarded from Felix Shpilman:

Evgeny Shneyderman:

hh.ru/vacancy/27723418

Вакансия Head Of Data Science в Москве, работа в Островок.ру

Вакансия Head Of Data Science. Зарплата: не указана. Москва. Требуемый опыт: 3–6 лет. Полная занятость. Дата публикации: 07.11.2018.


snakers4 (Alexander), October 04, 06:03

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):

ngrams = zip(*[string[i:] for i in range(n)])

ngrams = [''.join(_) for _ in ngrams]

return ngrams

string = 'грёзоблаженствующий'

ngrams = []

for i in range(3,7):

ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)

ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))

print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))#nlp

snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp

medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee

Please like / share / repost the article =)

#nlp

#data_science

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 03, 15:15

Forwarded from Админим с Буквой:

Github и ssh-ключи

Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.

github.com/bykvaadm.keys

#github

snakers4 (Alexander), October 02, 09:59

PyTorch 1.0 PRE-RELEASE

github.com/pytorch/pytorch/releases/tag/v1.0rc0

Looks like it features tools to deploy PyTorch models...

#data_science

pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch


snakers4 (Alexander), October 02, 03:01

New release of keras

github.com/keras-team/keras/releases/tag/2.2.3

#deep_learning

keras-team/keras

Deep Learning for humans. Contribute to keras-team/keras development by creating an account on GitHub.


snakers4 (Alexander), September 29, 10:53

Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:

- drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science

snakers4 (Alexander), September 29, 10:48

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)

In case of Russian you can write here

tatianashavrina.github.io/taiga_site/

The author will share her 90+GB RAW corpus with you

(2)

In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with

- commoncrawl.org/connect/blog/

- commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

- www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.


snakers4 (Alexander), September 28, 11:12

DS/ML digest 25

spark-in.me/post/2018_ds_ml_digest_25

#digest

#deep_learning

#data_science

2018 DS/ML digest 25

2018 DS/ML digest 25 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 28, 05:40

New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization

- www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.

At least it looks like they are not pushing their crappy library =)

The problem with any such visualizations is that they work only for toy datasets.

Drop / shuffle method seems to be more robust.

#data_science

snakers4 (Alexander), September 27, 01:47

youtu.be/dyzn3Fmtw-E

This Painter AI Fools Art Historians 39% of the Time
Pick up cool perks on our Patreon page: www.patreon.com/TwoMinutePapers Crypto and PayPal links are available below. Thank you very much for your gen...

snakers4 (Alexander), September 26, 13:22

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)

wget rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp

snakers4 (Alexander), September 24, 12:53

(RU) most popular ML algorithms explained in simple terms

vas3k.ru/blog/machine_learning/

#data_science

Машинное обучение для людей

Разбираемся простыми словами


snakers4 (Alexander), September 20, 16:06

DS/ML digest 24

Key topics of this one:

- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;

- So many releases from Google;

spark-in.me/post/2018_ds_ml_digest_24

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

#digest

#deep_learning

#data_science

2018 DS/ML digest 24

2018 DS/ML digest 24 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 19, 09:59

Using sklearn pairwise cosine similarity

scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.

#nlp

snakers4 (Alexander), September 14, 05:32

Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:

- Understanding attention jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

- Annotated transformer nlp.seas.harvard.edu/2018/04/03/attention.html

- Illustrated transformer jalammar.github.io/illustrated-transformer/

Playing with transformer in practice

This repo turned out to be really helpful

github.com/huggingface/pytorch-openai-transformer-lm

It features:

- Decent well encapsulated model and loss;

- Several head for different tasks;

- It works;

- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:

- It works;

- It is high capacity;

- Inference time is ~`5x` higher than char-level or plain RNNs;

- It serves as a classifier as well as an LM;

- Capacity is enough to tackle most challenging tasks;

- It can be deployed on CPU for small texts (!);

- On smaller tasks there is no clear difference between plain RNNs and Transformer;

#nlp

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example. Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014). I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned…


snakers4 (Alexander), September 11, 18:12

Gensim's fast-text subwords

Some monkey patching to get subwords from Gensim's fast-text

from gensim.models.utils_any2vec import _compute_ngrams,_ft_hash

def subword(self, word):

ngram_lst = []

ngrams = _compute_ngrams(word, self.min_n, self.max_n)

for ngram in ngrams:

ngram_hash = _ft_hash(ngram) % self.bucket

if ngram_hash in self.hash2index:

ngram_lst.append(ngram)

return ngram_lst

gensim.models.keyedvectors.FastTextKeyedVectors.subword = subword

snakers4 (Alexander), September 11, 06:00

Useful Python / PyTorch bits

dot.notation access to dictionary attributes

class dotdict(dict):

__getattr__ = dict.get

__setattr__ = dict.__setitem__

__delattr__ = dict.__delitem__

PyTorch embedding layer - ignore padding

nn.Embedding has a padding_idx attribute not to update the padding token embedding.

#python

#pytorch

snakers4 (Alexander), September 06, 16:20

youtu.be/HvH0b9K_Iro

This AI Performs Super Resolution in Less Than a Second
The paper "A Fully Progressive Approach to Single-Image Super-Resolution" is available here: igl.ethz.ch/projects/prosr/ A-Man's Caustic scene: http:/...

snakers4 (Alexander), September 06, 06:18

SeNet

- arxiv.org/abs/1709.01507;

- A 2017 Imagenet winner;

- Mostly ResNet-152 inspired network;

- Transfers well (ResNet);

- Squeeze and Excitation (SE) block, that adaptively recalibratess channel-wise feature responses by explicitly modelling in- terdependencies between channels;

- Intuitively looks like - convolution meet the attention mechanism;

- SE block:

- pics.spark-in.me/upload/aa50a2559f56faf705ad6639ac973a38.jpg

- Reduction ratio r to be 16 in all experiments;

- Results:

- pics.spark-in.me/upload/db2c98330744a6fd4dab17259d5f9d14.jpg

#deep_learning

snakers4 (Alexander), September 06, 05:57

Chainer - a predecessor of PyTorch

Looks like

- PyTorch was based not only on Torch, but also its autograd was forked from Chainer;

- Chainer looks like PyTorch ... but not by Facebook, but by independent Japanese group;

- A quick glance through the docs confirms that PyTorch and Chainer APIs look 90% identical (both numpy inspired, but using different back-ends);

- Open Images 2nd place was taken by people using Chainer with 512 GPUs;

- I have yet to confirm myself that PyTorch can work with a cluster (but other people have done it) github.com/eladhoffer/convNet.pytorch;

www.reddit.com/r/MachineLearning/comments/7lb5n1/d_chainer_vs_pytorch/

docs.chainer.org/en/stable/comparison.html

#deep_learning

eladhoffer/convNet.pytorch

ConvNet training using pytorch. Contribute to eladhoffer/convNet.pytorch development by creating an account on GitHub.


Also - thanks for all DO referral link supporters - now finally hosting of my website is free (at least for next ~6 months)!

Also today I published a 200th post on spark-in.me. Ofc not all of these are proper long articles, but nevertheless it's cool.

snakers4 (Alexander), September 06, 05:48

DS/ML digest 23

The key topic of this one - is this is insanity

- vid2vid

- unsupervised NMT

spark-in.me/post/2018_ds_ml_digest_23

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

Let's spread the right DS/ML ideas together.

#digest

#deep_learning

#data_science

2018 DS/ML digest 23

2018 DS/ML digest 23 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 05, 06:40

MySQL - replacing window functions

Older versions of MySQL (and maybe newer ones) do not have all the goodness you can find in PostgreSQL. Ofc you can do plain session matching in Python, but sometimes you just need to do it in plain SQL.

In Postgres you usually use window functions for this purpose if you need PLAIN SQL (ofc there are stored procedures / views / mat views etc).

In MySQL it can be elegantly solved like this:

SET @session_number = 0, @last_uid = '0', @current_id = '0', @dif=0;

SELECT

t1.some_field,

t2.some_field,

...

@last_uid:[email protected]_uid,

@current_uid:=t1.uid,

@dif:=TIMESTAMPDIFF(MINUTE, t2.session_ts, t1.session_ts),

if(@[email protected]_uid, if(@dif > 30,@session_number:[email protected]_number+1,@session_number),@session_number:=0) as session

FROM

table1 t1

JOIN table2 t2 on t1.id = t2.id+1

#data_science

snakers4 (Alexander), September 03, 06:27

Training a MNASNET from scratch ... and failing

As a small side hobby we tried training new Google's mobile network from scratch and failed:

- spark-in.me/post/mnasnet-fail-alas

- github.com/snakers4/mnasnet-pytorch

Maybe you know how to train it properly?

Also now you can upvote articles on spark in me! =)

#deep_learning

Training your own MNASNET

Training your own MNASNET Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 02, 19:49

youtu.be/cEBgi6QYDhQ

Everybody Dance Now! - AI-Based Motion Transfer
Pick up cool perks on our Patreon page: www.patreon.com/TwoMinutePapers The paper "Everybody Dance Now" is available here: arxiv.org/abs/1808...

snakers4 (Alexander), September 02, 06:22

A small hack to spare PyTorch memory when resuming training

When you resume from a checkpoint, consider adding this to save GPU memory:

del checkpoint

torch.cuda.empty_cache()

#deep_learning

snakers4 (Alexander), August 31, 13:59

DS/ML digest 22

spark-in.me/post/2018_ds_ml_digest_22

#digest

#deep_learning

#data_science

2018 DS/ML digest 22

2018 DS/ML digest 22 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), August 31, 13:38

ADAMW to be integrated into upstream PyTorch?

github.com/pytorch/pytorch/pull/3740

#deep_learning

Fixing Weight Decay Regularization in Adam by jingweiz · Pull Request #3740 · pytorch/pytorch

Hey, We added SGDW and AdamW in optim, accoridng to the new ICLR submission from Loshchilov and Hutter: Fixing Weight Decay Regularization in Adam. We also found some inconsistency of the current i...


snakers4 (Alexander), August 29, 08:16

Crowd-AI maps repo

Just opened my repo for crowd AI maps 2018.

Did not pursue this competition till the end, so it is not polished, .md is not updated. Use it at your own risk!

github.com/snakers4/crowdai-maps-2018

spark-in.me/post/a-small-case-for-search-of-structure-within-your-data

#deep_learning

snakers4/crowdai-maps-2018

CrowdAI mapping challenge 2018 solution. Contribute to snakers4/crowdai-maps-2018 development by creating an account on GitHub.