Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1317 members, 1587 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), September 24, 12:53

(RU) most popular ML algorithms explained in simple terms


Машинное обучение для людей

Разбираемся простыми словами

snakers4 (Alexander), September 20, 16:06

DS/ML digest 24

Key topics of this one:

- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;

- So many releases from Google;

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);




2018 DS/ML digest 24

2018 DS/ML digest 24 Статьи автора - Блог -

snakers4 (Alexander), September 06, 05:48

DS/ML digest 23

The key topic of this one - is this is insanity

- vid2vid

- unsupervised NMT

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

Let's spread the right DS/ML ideas together.




2018 DS/ML digest 23

2018 DS/ML digest 23 Статьи автора - Блог -

snakers4 (Alexander), September 05, 06:40

MySQL - replacing window functions

Older versions of MySQL (and maybe newer ones) do not have all the goodness you can find in PostgreSQL. Ofc you can do plain session matching in Python, but sometimes you just need to do it in plain SQL.

In Postgres you usually use window functions for this purpose if you need PLAIN SQL (ofc there are stored procedures / views / mat views etc).

In MySQL it can be elegantly solved like this:

SET @session_number = 0, @last_uid = '0', @current_id = '0', @dif=0;





@last_uid:[email protected]_uid,


@dif:=TIMESTAMPDIFF(MINUTE, t2.session_ts, t1.session_ts),

if(@[email protected]_uid, if(@dif > 30,@session_number:[email protected]_number+1,@session_number),@session_number:=0) as session


table1 t1

JOIN table2 t2 on =


snakers4 (Alexander), August 31, 13:59

DS/ML digest 22




2018 DS/ML digest 22

2018 DS/ML digest 22 Статьи автора - Блог -

snakers4 (Alexander), August 29, 05:45

Venn diagrams in python

Compare sets as easy as:

# Import the library

import matplotlib.pyplot as plt

from matplotlib_venn import venn3

# Make the diagram


venn3(subsets = (s1,s2,s3),set_labels=['synonyms','our_tree','add_syns'])

Very simple and useful.


snakers4 (Alexander), August 28, 14:42

Garbage collection and encapsulation in python

Usually this is not an issue.

But when doing a batch-job / loop / something with 100k+ objects it suddenly becomes an issue.

Also it does not help that the environment you are testing your code and the actual running environments are different.

The solution - simply encapsulate everything you can. Garbage is automatically collected.


A case against Kaggle

If you thought that Kaggle is the home of Data Science - think again.

This is official - they do not know the hell they are doing.

There have been several appalling cases already, but this takes the prize.

Following this thread, wrote a small petition to Kaggle

I doubt that they will hear, but why not.


snakers4 (Alexander), August 24, 07:04

Played with FAISS

Played with FAISS on the GPU a bit. Their docs also cover simplistic cases really well, but more sophisticated cases are better inferred from examples, because their Python docs are not really intended for heavy use.

Anyway I manged to build a KNN graph with FAISS on on GPU for 10m points in 2-3 hours.

It does the following:

- KNN graph;

- PCA, K-Means;

- Queries;

- VERY sophisticated indexing with many option;

It supports:

- GPU;

- Multi-GPU (I had to use env. variables to limit GPU list, because there is no option for python);

Also their docs are awsome for such a low-level project.,-PCA,-quantization




A library for efficient similarity search and clustering of dense vectors. - facebookresearch/faiss

Playing with Atom and Hydrogen

TLDR - it delivers to turn the Atom editor into something like interactive notebook, but its auto-completion options are very scarce. Also you can connect to your running ipython kernel as well as docker container.

But it is no match to either a normal notebook or a python IDE.


snakers4 (Alexander), August 12, 11:15

2018 DS/ML digest 20




2018 DS/ML digest 20

2018 DS/ML digest 20 Статьи автора - Блог -

snakers4 (Alexander), August 12, 05:21

New publication format

Decided to try a new, more streamlined, fast and automated approach to publishing a bit longer posts.

(0) Write a note in md format

(1) Transform to HTML automatically => post on

(2) Repost to medium via automated import

(3) Repost to Reddit / (if they start accepting English articles) via md


(4) Profit - 4 publications at the cost and time of one)

Decided to start with porting 2 latest articles to medium



Please tell me what you think in the comments!

Also md can be transformed almost to any format using pandoc)


Playing with Crowd-AI mapping challenge — or how to improve your CNN performance with self-supervised techniques

Originally published at on July 15, 2018.

snakers4 (Alexander), August 07, 04:04


Wrote a couple of posts about UMAP before.

Since last time, they extended their docs and published a paper:

- How it works (topology) - I kind of understand 50% of this

- Paper (have not read yet)

What I really like about UMAP author - he answers questions on the forums / invested a lot of time into explaining how UMAP and HDBSCAN work / built stellar docs and is overall a nice guy.

What I really like in practice - this combination works really well:




Uniform Manifold Approximation and Projection. Contribute to lmcinnes/umap development by creating an account on GitHub.

snakers4 (Alexander), July 31, 04:42

Airbus ship detection challenge

On a surface this looks like a challenging and interesting competition:


- Train / test sets - 14G / 12G

- Downside - Kaggle and very fragile metric

- Upside - a separate significant price for fast algorithms!

- 768x768 images seem reasonable



Airbus Ship Detection Challenge

Find ships on satellite images as quickly as possible

snakers4 (Alexander), July 23, 06:26

My post on open images stage 1

For posterity

Please comment



Solving class imbalance on Google open images

In this article I propose an appoach to solve a severe class imbalance on Google open images Статьи автора - Блог -

snakers4 (Alexander), July 23, 05:15

2018 DS/ML digest 18

Highlights of the week

(0) RL flaws

(1) An intro to AUTO-ML

(2) Overview of advances in ML in last 12 months

Market / applied stuff / papers

(0) New Nvidia Jetson released

(1) Medical CV project in Russia - 90% is data gathering

(2) Differentiable architecture search

-- 1800 GPU days of reinforcement learning (RL) (Zoph et al., 2017)

-- 3150 GPU days of evolution (Real et al., 2018)

-- 4 GPU days to achieve SOTA in CIFAR => transferrable to Imagenet with 26.9% top-1 error

(3) Some basic thoughts about hyper-param tuning

(4) FB extending fact checking to mark similar articles

(5) Architecture behind Alexa choosing skills

- Char-level RNN + Word-level RNN

- Shared encoder, but attention is personalized

(6) An overview of contemporary NLP techniques

(7) RNNs in particle physics?

(8) Google cloud provides PyTorch images


(0) Use embeddings for positions - no brainer

(1) Chatbots were a hype train - lol

The vast majority of bots are built using decision-tree logic, where the bot’s canned response relies on spotting specific keywords in the user input.Interesting links

(0) Reasons to use OpenStreetMap

(1) Google deployes its internet ballons

(2) Amazing problem solving

(3) Nice flame thread about CS / ML is not science / just engineering etc




RL’s foundational flaw

RL as classically formulated has lately accomplished many things - but that formulation is unlikely to tackle problems beyond games. Read on to see why!

snakers4 (Alexander), July 21, 11:02

Found an amazing explanation about Python's super here

Understanding Python super() with __init__() methods

I'm trying to understand the use of super(). From the looks of it, both child classes can be created, just fine. I'm curious to know about the actual difference between the following 2 child clas...

Playing with focal loss for multi-class classification

Playing with this Loss

If anyone has a better option - please PM me / or comment in the gist.



Multi class classification focal loss

Multi class classification focal loss . GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), July 17, 08:32

snakers4 (Alexander), July 15, 08:54

Sometimes in supervised ML tasks leveraging the data sctructure in a self-supervised fashion really helps!

Playing with CrowdAI mapping competition

In my opinion it is a good test-ground for testing your ideas with SemSeg - as the dataset is really clean and balanced




Playing with Crowd-AI mapping challenge - or how to improve your CNN performance with self-supervised techniques

In this article I tell about a couple of neat optimizations / tricks / useful ideas that can be applied to many SemSeg / ML tasks Статьи автора - Блог -

snakers4 (spark_comment_bot), July 15, 06:16

Feeding images / tensors of different size using PyTorch dataloader classes

Struggled to do this properly on DS Bowl (I resorted to random crops there for training and 1-image sized batches for validation).

Suppose your dataset has some internal structure in it.

For example - you may have images of vastly different aspect ratios (3x1, 1x3 and 1x1) and you would like to squeeze every bit of performance from your pipeline.

Of course, you may pad your images / center-crop them / random crop them - but in this case you will lose some of the information.

I played with this on some tasks - sometimes force-resize works better than crops, but trying to apply your model convolutionally worked really good on SemSeg challenges.

So it may work very well on plain classification as well.

So, if you apply your model convolutionally, you will end up with differently-sized feature maps for each cluster of images.

Within the model, it can be fixed with:

(0) Adaptive avg pooling layers

(1) Some simple logic in .forward statement of the model

But anyway you end up with a small technical issue - PyTorch cannot concatenate tensors of different sizes using standard collation function.

Theoretically, there are several ways to fix this:

(0) Stupid solution - create N datasets, train on them sequentially.

In practice I tried that on DS Bowl - it worked poorly - the model overfitted to each cluster, and then performed poorly on next one;

(1) Crop / pad / resize images (suppose you deliberately want to avoid that);

(2) Insert some custom logic into PyTorch collattion function, i.e. resize there;

(3) Just sample images so that only images of one size end up within each batch;

(0) and (1) I would like to avoid intentionally.

(2) seems a bit stupid as well, because resizing should be done as a pre-processing step (collation function deals with normalized tensors, not images) and it is better not to mix purposes of your modules

Ofc, you can try to produce N tensors in (2) - i.e. tensor for each image size, but that would require additional loop downstream.

In the end, I decided that (3) is the best approach - because it can be easily transferred to other datasets / domains / tasks.

Long story short - here is my solution - I just extended their sampling function:

Maybe it is worth a PR on Github?

What do you think?



Like this post or have something to say => tell us more in the comments or donate!

[feature request] Support tensors of different sizes as batch elements in DataLoader #1512

Motivating example is returning bounding box annotation for images along with an image. An annotation list can contain variable number of boxes depending on an image, and padding them to a single length (and storing that length) may be n...

snakers4 (Alexander), July 13, 09:15

Tensorboard + PyTorch

6 months ago looked at this - and it was messy

now it looks really polished



tensorboard-pytorch - tensorboard for pytorch (and chainer, mxnet, numpy, ...)

snakers4 (spark_comment_bot), July 13, 05:22

2018 DS/ML digest 17

Highlights of the week

(0) Troubling trends with ML scholars

(1) NLP close to its ImageNet stage?

Papers / posts / articles

(0) Working with multi-modal data

- concatenation-based conditioning

- conditional biasing or scaling ("residual" connections)

- sigmoidal gating

- all in all this approach seems like a mixture of attention / gating for multi-modal problems

(1) Glow, a reversible generative model which uses invertible 1x1 convolutions

(2) Facebooks moonshots - I kind of do not understand much here


(3) RL concept flaws?


(4) Intriguing failures of convolutions - this is fucking amazing

(5) People are only STARTING to apply ML to reasoning

Yet another online book on Deep Learning

(1) Kind of standard!/book/grokking-deep-learning/chapter-1/v-10/1

Libraries / code

(0) Data version control continues to develop




Like this post or have something to say => tell us more in the comments or donate!

Troubling Trends in Machine Learning Scholarship

By Zachary C. Lipton* & Jacob Steinhardt* *equal authorship Originally presented at ICML 2018: Machine

snakers4 (Alexander), July 09, 09:04

2018 DS/ML digest 16

Papers / posts

(0) RL now solves Quake

(1) A post about AdamW

-- Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam

-- Amsgrad turns out to be very disappointing

-- Refresher article

(2) How to tackle new classes in CV

(3) A new word in GANs?



(4) Using deep learning representations for search


-- library for fast search on python

(5) One more paper on GAN convergence

(6) Switchable normalization - adds a bit to ResNet50 + pre-trained models


(0) Disney starts to release datasets

Market / interesting links

(0) A motion to open-source GitHub

(1) Allegedly GTX 1180 start in sales appearing in Asia (?)

(2) Some controversy regarding Andrew Ng and self-driving cars

(3) National AI strategies overviewed -

-- Canada C$135m

-- China has the largest strategy

-- Notably - countries like Finland also have one

(4) Amazon allegedly sells face recognition to the USA



Google’s DeepMind taught AI teamwork by playing Quake III Arena

Google’s DeepMind today shared the results of training multiple AI systems to play Capture the Flag on Quake III Arena, a multiplayer first-person shooter game. The AI played nearly 450,000 g…

snakers4 (Alexander), July 08, 06:06

A new multi-threaded addition to pandas stack?

Read about this some time ago (when this was just in development // - found essentially 3 alternatives

- just being clever about optimizing your operations + using what is essentially a multi-threaded map/reduce in pandas //

- pandas on ray

- dask (overkill)





So...I ran a test in the notebook I had on hand. It works. More tests will be done in future.



Spark in me - Internet, data science, math, deep learning, philosophy

Pandas on Ray - RISE Lab

snakers4 (spark_comment_bot), July 07, 12:29

Playing with VAEs and their practical use

So, I played a bit with Variational Auto Encoders (VAE) and wrote a small blog post on this topic

Please like, share and repost!



Like this post or have something to say => tell us more in the comments or donate!

Playing with Variational Auto Encoders - PCA vs. UMAP vs. VAE on FMNIST / MNIST

In this article I thoroughly compare the performance of VAE / PCA / UMAP embeddings on a simplistic domain - UMAP Статьи автора - Блог -

snakers4 (Alexander), July 04, 07:57

2018 DS/ML digest 15

What I filtered through this time

Market / news

(0) Letters by big company employees against using ML for weapons

- Microsoft

- Amazon

(1) Facebook open sources Dense Pose (eseentially this is Mask-RCNN)


Papers / posts / NLP

(0) One more blog post about text / sentence embeddings

- key idea different weighting

(1) One more sentence embedding calculation method

- ?

(2) Posts explaing NLP embeddings

- - some basics - SVD / Word2Vec / GloVe

-- SVD improves embedding quality (as compared to ohe)?

-- use log-weighting, use TF-IDF weighting (the above weighting)

- - word embedding properties

-- dimensions vs. embedding quality

(3) Spacy + Cython = 100x speed boost - - good to know about this as a last resort

- described use-case

you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow

or you have a heavy processing logic in your DeepLearning batch loader that slows down your training

(4) Once again stumbled upon this -

(5) Papers

- Simple NLP embedding baseline

- NLP decathlon for question answering

- Debiasing embeddings

- Once again transfer learning in NLP by open-AI -




Download full.pdf 0.04 MB

snakers4 (Alexander), July 02, 04:51

2018 DS/ML digest 14

Amazing article - why you do not need ML


- I personally love plain-vanilla SQL and in 90% of cases people under-use it

- I even wrote 90% of my JSON API on our blog in pure PostgreSQL xD

Practice / papers

(0) Interesting papers from CVPR

(1) Some down-to-earth obstacles to ML deploy

(2) Using synthetic data for CNNs (by Nvidia) -

(3) This puzzles me - so much effort and engineering spent on something ... strange and useless -

On paper they do a cool thing - investigate transfer learning between different domains, but in practice it is done on TF and there is no clear conclusion of any kind

(4) VAE + real datasets - only small Imagenet (64x64)

(5) Understanding the speed of models deployed on mobile -

(6) A brief overview of multi-modal methods

Visualizations / explanations

(0) Amazing website with ML explanations

(1) PCA and linear VAEs are close




No, you don't need ML/AI. You need SQL

A while ago, I did a Twitter thread about the need to use traditional and existing tools to solve everyday business problems other than jumping on new buzzwords, sexy and often times complicated technologies.

snakers4 (Alexander), July 01, 11:48

Measuring feature importance properly

Once again stumbled upon an amazing article about measuring feature importance for any ML algorithms:

(0) Permutation importance - if your ML algorithm is costly, then you can just shuffle a column and check importance

(1) Drop column importance - drop a column, re-train a model, check performance metrics

Why it is useful / caveats

(0) If you really care about understanding your domain - feature importances are a must have

(1) All of this works only for powerful models

(2) Landmines include - correlated or duplicate variables, data normalization

Correlated variables

(0) For RF - correlated variables share permutation importance roughly proportionally to their correlation

(1) Drop column importance can behave unpredictably

I personally like engineering different kinds of features and doing ablation tests:

(0) Among feature sets, sharing similar purpose

(1) Within feature sets


snakers4 (Alexander), June 10, 15:35

And now the article is also live -

Please support us with your likes!



Состязательные атаки (adversarial attacks) в соревновании Machines Can See 2018

Или как я оказался в команде победителей соревнования Machines Can See 2018 adversarial competition. Суть любых состязательных атак на примере. Так уж...

snakers4 (Alexander), June 10, 06:50

An interesting idea from a CV conference

Imagine that you have some kind of algorithm, that is not exactly differentiable, but is "back-propable".

In this case you can have very convoluted logic in your "forward" statement (essentially something in between trees and dynamic programming) - for example a set of clever if-statements.

In this case you will be able to share both of the 2 worlds - both your algorithm (you will have to re-implement in your framework) and backprop + CNN. Nice.

Ofc this works only for dynamic deep-learning frameworks.



Machines Can See 2018 adversarial competition

Happened to join forces with a team that won 2nd place in this competition


It was very entertaining and a new domain to me.

Read more materials:

- Our repo

- Our presentation

- All presentations




Playing with adversarial attacks on Machines Can See 2018 competition

This article is about MCS 2018 competition and my participation in it, adversarial attack methods and how out team won Статьи автора - Блог -

snakers4 (Alexander), June 07, 14:13

snakers4 (Alexander), May 25, 07:29

New competitions on Kaggle

Kaggle has started a new competition with video ... which is one of those competitions (read between the lines - blatant marketing)


- TensorFlow Record files

- Each of the top 5 ranked teams will receive $5,000 per team as a travel award - no real prizes

- The complete frame-level features take about 1.53TB of space (and yes, these are not videos, but extracted CNN features)

So, they are indeed using their platform to promote their business interests.

Released free datasets are really cool, but only when you can use then for transfer learning, which implies also seeing the underlying ground level data (i.e. images of videos).



The 2nd YouTube-8M Video Understanding Challenge

Can you create a constrained-size model to predict video labels?