Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1356 members, 1614 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

snakers4 (Alexander), October 19, 18:34

Detecting Faces (Viola Jones Algorithm) - Computerphile
Deep learning is used for everything these days, but this face detection algorithm is so neat its still in use today. Dr Mike Pound on the Viola/Jones algori...

snakers4 (Alexander), October 17, 14:41

Forwarded from :

François Chollet

Here is the same dynamic RNN implemented in 4 different frameworks (TensorFlow/Keras, MXNet/Gluon, Chainer, PyTorch). Can you tell which is which?

I guess PyTorch is in the bottom left corner, but realistically the author of this snippet did a lot of import A as B

snakers4 (Alexander), October 16, 05:15

Google's super resolution zoom

Finally Google made something interesting

Super Res Zoom

snakers4 (Alexander), October 16, 03:47

Mixed precision distributed training ImageNet example in PyTorch



A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex

snakers4 (Alexander), October 15, 17:11

Looks like mixed precision training ... is solved in PyTorch

Lol - and I could not find it



A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex

snakers4 (Alexander), October 15, 16:56

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:


kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 09:33

DS/ML digest 26

More interesting NLP papers / material ...




2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - Блог -

snakers4 (Alexander), October 12, 19:11

This AI Senses Humans Through Walls
Pick up cool perks on our Patreon page: › Crypto and PayPal links are available below. Thank you very much for your g...

snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal:$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives

- Google group!topic/common-crawl/6F-yXsC35xM



Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - with -x

(3) Profit


snakers4 (Alexander), October 08, 10:11

A small continuation of the crawling saga

2 takes on the Common Crawl

It turned out to be a bit tougher than expected

But doable


Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - Блог -

snakers4 (Alexander), October 08, 06:04

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- for large data-frames + some nice visualizations;

- for large visualizations;

- Also you can use Dask for these purposes I guess;


Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.


snakers4 (Alexander), October 08, 05:38

Wiki graph database

Just found out that Wikipedia also provides this



May be useful for research in future.

Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900

German musicians with German and English descriptions

Musicians who were born in Berlin



snakers4 (Alexander), October 06, 13:04

PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.

All the others I tried were crap.


snakers4 (Alexander), October 06, 07:24

Monkey patching a PyTorch model

Well, ideally you should not do this.

But sometimes you just need to quickly test something and amend your model on the fly.

This helps:

import torch

import functools

def rsetattr(obj, attr, val):

pre, _, post = attr.rpartition('.')

return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):

def _getattr(obj, attr):

return getattr(obj, attr, *args)

return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():

old_module_path = module[0]

old_module_object = module[1]

# replace an old object with the new one

# copy some settings and its state

if isinstance(old_module_object,torch.nn.SomeClass):

new_module = SomeOtherClass(old_module_object.some_settings,




The above code essentially does the same as:

model = some_other_block






snakers4 (Alexander), October 05, 16:47

Russian post on Habr

Please support if you have an account.


Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone cd wikiextractor wget http...

snakers4 (Alexander), October 05, 16:29

Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.


Forwarded from Felix Shpilman:

Evgeny Shneyderman:

Вакансия Head Of Data Science в Москве, работа в Островок.ру

Вакансия Head Of Data Science. Зарплата: не указана. Москва. Требуемый опыт: 3–6 лет. Полная занятость. Дата публикации: 07.10.2018.

snakers4 (Alexander), October 04, 06:03

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):

ngrams = zip(*[string[i:] for i in range(n)])

ngrams = [''.join(_) for _ in ngrams]

return ngrams

string = 'грёзоблаженствующий'

ngrams = []

for i in range(3,7):


ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)

ft_ngrams = set(ft_ngrams)



snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

Please like / share / repost the article =)



Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - Блог -

snakers4 (Alexander), October 03, 15:15

Forwarded from Админим с Буквой:

Github и ssh-ключи

Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.


snakers4 (Alexander), October 02, 09:59


Looks like it features tools to deploy PyTorch models...



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), October 02, 03:01

New release of keras



Deep Learning for humans. Contribute to keras-team/keras development by creating an account on GitHub.

snakers4 (Alexander), September 29, 10:53

Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:



snakers4 (Alexander), September 29, 10:48

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?


In case of Russian you can write here

The author will share her 90+GB RAW corpus with you


In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with





Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.

snakers4 (Alexander), September 28, 11:12

DS/ML digest 25




2018 DS/ML digest 25

2018 DS/ML digest 25 Статьи автора - Блог -

snakers4 (Alexander), September 28, 05:40

New course

Mainly decision tree practice.

A lot about decision tree visualization


I personally would check out the visualization bits.

At least it looks like they are not pushing their crappy library =)

The problem with any such visualizations is that they work only for toy datasets.

Drop / shuffle method seems to be more robust.


snakers4 (Alexander), September 27, 01:47

This Painter AI Fools Art Historians 39% of the Time
Pick up cool perks on our Patreon page: Crypto and PayPal links are available below. Thank you very much for your gen...

snakers4 (Alexander), September 26, 13:22

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)



snakers4 (Alexander), September 24, 12:53

(RU) most popular ML algorithms explained in simple terms


Машинное обучение для людей

Разбираемся простыми словами

snakers4 (Alexander), September 20, 16:06

DS/ML digest 24

Key topics of this one:

- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;

- So many releases from Google;

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);




2018 DS/ML digest 24

2018 DS/ML digest 24 Статьи автора - Блог -

snakers4 (Alexander), September 19, 09:59

Using sklearn pairwise cosine similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.