Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1369 members, 1636 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), November 15, 08:09

DS/ML digest 29




2018 DS/ML digest 29

2018 DS/ML digest 29 Статьи автора - Блог -

snakers4 (Alexander), November 10, 18:32

Towards Data Science

Our article was accepted to their publication:


Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)

I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.


Building client routing / semantic search in the wild

A comparison of novel NLP techniques within an applied business setting

snakers4 (Alexander), November 06, 13:45

DS/ML digest 28

Google open sources pre-trained BERT ... with 102 languages ...




2018 DS/ML digest 28

2018 DS/ML digest 28 Статьи автора - Блог -

snakers4 (Alexander), November 03, 09:40

Building client routing / semantic search and clustering arbitrary external corpuses at

A brief executive summary about what we achieved at

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.




Building client routing / semantic search and clustering arbitrary external corpuses at

Building client routing / semantic search and clustering arbitrary external corpuses at Статьи автора - Блог -

snakers4 (Alexander), October 23, 06:28

DS/ML digest 27

NLP in the focus again!

Also your humble servant learned how to do proper NMT =)




2018 DS/ML digest 27

2018 DS/ML digest 27 Статьи автора - Блог -

snakers4 (Alexander), October 22, 05:43

Amazing articles about image hashing

Also a python library

- Library

- Articles:




A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.

Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models -

It is explained here - - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.



snakers4 (Alexander), October 15, 16:56

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:


kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 09:33

DS/ML digest 26

More interesting NLP papers / material ...




2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - Блог -

snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal:$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives

- Google group!topic/common-crawl/6F-yXsC35xM



Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - with -x

(3) Profit


snakers4 (Alexander), October 08, 06:04

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- for large data-frames + some nice visualizations;

- for large visualizations;

- Also you can use Dask for these purposes I guess;


Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.


snakers4 (Alexander), October 08, 05:38

Wiki graph database

Just found out that Wikipedia also provides this



May be useful for research in future.

Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900

German musicians with German and English descriptions

Musicians who were born in Berlin



snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

Please like / share / repost the article =)



Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - Блог -

snakers4 (Alexander), October 02, 09:59


Looks like it features tools to deploy PyTorch models...



Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch

snakers4 (Alexander), September 29, 10:53

Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:



snakers4 (Alexander), September 28, 11:12

DS/ML digest 25




2018 DS/ML digest 25

2018 DS/ML digest 25 Статьи автора - Блог -

snakers4 (Alexander), September 28, 05:40

New course

Mainly decision tree practice.

A lot about decision tree visualization


I personally would check out the visualization bits.

At least it looks like they are not pushing their crappy library =)

The problem with any such visualizations is that they work only for toy datasets.

Drop / shuffle method seems to be more robust.


snakers4 (Alexander), September 24, 12:53

(RU) most popular ML algorithms explained in simple terms


Машинное обучение для людей

Разбираемся простыми словами

snakers4 (Alexander), September 20, 16:06

DS/ML digest 24

Key topics of this one:

- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;

- So many releases from Google;

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);




2018 DS/ML digest 24

2018 DS/ML digest 24 Статьи автора - Блог -

snakers4 (Alexander), September 06, 05:48

DS/ML digest 23

The key topic of this one - is this is insanity

- vid2vid

- unsupervised NMT

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

Let's spread the right DS/ML ideas together.




2018 DS/ML digest 23

2018 DS/ML digest 23 Статьи автора - Блог -

snakers4 (Alexander), September 05, 06:40

MySQL - replacing window functions

Older versions of MySQL (and maybe newer ones) do not have all the goodness you can find in PostgreSQL. Ofc you can do plain session matching in Python, but sometimes you just need to do it in plain SQL.

In Postgres you usually use window functions for this purpose if you need PLAIN SQL (ofc there are stored procedures / views / mat views etc).

In MySQL it can be elegantly solved like this:

SET @session_number = 0, @last_uid = '0', @current_id = '0', @dif=0;





@last_uid:[email protected]_uid,


@dif:=TIMESTAMPDIFF(MINUTE, t2.session_ts, t1.session_ts),

if(@[email protected]_uid, if(@dif > 30,@session_number:[email protected]_number+1,@session_number),@session_number:=0) as session


table1 t1

JOIN table2 t2 on =


snakers4 (Alexander), August 31, 13:59

DS/ML digest 22




2018 DS/ML digest 22

2018 DS/ML digest 22 Статьи автора - Блог -

snakers4 (Alexander), August 29, 05:45

Venn diagrams in python

Compare sets as easy as:

# Import the library

import matplotlib.pyplot as plt

from matplotlib_venn import venn3

# Make the diagram


venn3(subsets = (s1,s2,s3),set_labels=['synonyms','our_tree','add_syns'])

Very simple and useful.


snakers4 (Alexander), August 28, 14:42

Garbage collection and encapsulation in python

Usually this is not an issue.

But when doing a batch-job / loop / something with 100k+ objects it suddenly becomes an issue.

Also it does not help that the environment you are testing your code and the actual running environments are different.

The solution - simply encapsulate everything you can. Garbage is automatically collected.


A case against Kaggle

If you thought that Kaggle is the home of Data Science - think again.

This is official - they do not know the hell they are doing.

There have been several appalling cases already, but this takes the prize.

Following this thread, wrote a small petition to Kaggle

I doubt that they will hear, but why not.


snakers4 (Alexander), August 24, 07:04

Played with FAISS

Played with FAISS on the GPU a bit. Their docs also cover simplistic cases really well, but more sophisticated cases are better inferred from examples, because their Python docs are not really intended for heavy use.

Anyway I manged to build a KNN graph with FAISS on on GPU for 10m points in 2-3 hours.

It does the following:

- KNN graph;

- PCA, K-Means;

- Queries;

- VERY sophisticated indexing with many option;

It supports:

- GPU;

- Multi-GPU (I had to use env. variables to limit GPU list, because there is no option for python);

Also their docs are awsome for such a low-level project.,-PCA,-quantization




A library for efficient similarity search and clustering of dense vectors. - facebookresearch/faiss

Playing with Atom and Hydrogen

TLDR - it delivers to turn the Atom editor into something like interactive notebook, but its auto-completion options are very scarce. Also you can connect to your running ipython kernel as well as docker container.

But it is no match to either a normal notebook or a python IDE.


snakers4 (Alexander), August 12, 11:15

2018 DS/ML digest 20




2018 DS/ML digest 20

2018 DS/ML digest 20 Статьи автора - Блог -

snakers4 (Alexander), August 12, 05:21

New publication format

Decided to try a new, more streamlined, fast and automated approach to publishing a bit longer posts.

(0) Write a note in md format

(1) Transform to HTML automatically => post on

(2) Repost to medium via automated import

(3) Repost to Reddit / (if they start accepting English articles) via md


(4) Profit - 4 publications at the cost and time of one)

Decided to start with porting 2 latest articles to medium



Please tell me what you think in the comments!

Also md can be transformed almost to any format using pandoc)


Playing with Crowd-AI mapping challenge — or how to improve your CNN performance with self-supervised techniques

Originally published at on July 15, 2018.

snakers4 (Alexander), August 07, 04:04


Wrote a couple of posts about UMAP before.

Since last time, they extended their docs and published a paper:

- How it works (topology) - I kind of understand 50% of this

- Paper (have not read yet)

What I really like about UMAP author - he answers questions on the forums / invested a lot of time into explaining how UMAP and HDBSCAN work / built stellar docs and is overall a nice guy.

What I really like in practice - this combination works really well:




Uniform Manifold Approximation and Projection. Contribute to lmcinnes/umap development by creating an account on GitHub.

snakers4 (Alexander), July 31, 04:42

Airbus ship detection challenge

On a surface this looks like a challenging and interesting competition:


- Train / test sets - 14G / 12G

- Downside - Kaggle and very fragile metric

- Upside - a separate significant price for fast algorithms!

- 768x768 images seem reasonable



Airbus Ship Detection Challenge

Find ships on satellite images as quickly as possible

snakers4 (Alexander), July 23, 06:26

My post on open images stage 1

For posterity

Please comment



Solving class imbalance on Google open images

In this article I propose an appoach to solve a severe class imbalance on Google open images Статьи автора - Блог -

snakers4 (Alexander), July 23, 05:15

2018 DS/ML digest 18

Highlights of the week

(0) RL flaws

(1) An intro to AUTO-ML

(2) Overview of advances in ML in last 12 months

Market / applied stuff / papers

(0) New Nvidia Jetson released

(1) Medical CV project in Russia - 90% is data gathering

(2) Differentiable architecture search

-- 1800 GPU days of reinforcement learning (RL) (Zoph et al., 2017)

-- 3150 GPU days of evolution (Real et al., 2018)

-- 4 GPU days to achieve SOTA in CIFAR => transferrable to Imagenet with 26.9% top-1 error

(3) Some basic thoughts about hyper-param tuning

(4) FB extending fact checking to mark similar articles

(5) Architecture behind Alexa choosing skills

- Char-level RNN + Word-level RNN

- Shared encoder, but attention is personalized

(6) An overview of contemporary NLP techniques

(7) RNNs in particle physics?

(8) Google cloud provides PyTorch images


(0) Use embeddings for positions - no brainer

(1) Chatbots were a hype train - lol

The vast majority of bots are built using decision-tree logic, where the bot’s canned response relies on spotting specific keywords in the user input.Interesting links

(0) Reasons to use OpenStreetMap

(1) Google deployes its internet ballons

(2) Amazing problem solving

(3) Nice flame thread about CS / ML is not science / just engineering etc




RL’s foundational flaw

RL as classically formulated has lately accomplished many things - but that formulation is unlikely to tackle problems beyond games. Read on to see why!