Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1319 members, 1513 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), July 15, 08:54

Sometimes in supervised ML tasks leveraging the data sctructure in a self-supervised fashion really helps!

Playing with CrowdAI mapping competition

In my opinion it is a good test-ground for testing your ideas with SemSeg - as the dataset is really clean and balanced




Playing with Crowd-AI mapping challenge - or how to improve your CNN performance with self-supervised techniques

In this article I tell about a couple of neat optimizations / tricks / useful ideas that can be applied to many SemSeg / ML tasks Статьи автора - Блог -

snakers4 (Alexander), July 13, 09:15

Tensorboard + PyTorch

6 months ago looked at this - and it was messy

now it looks really polished



tensorboard-pytorch - tensorboard for pytorch (and chainer, mxnet, numpy, ...)

snakers4 (spark_comment_bot), July 13, 05:22

2018 DS/ML digest 17

Highlights of the week

(0) Troubling trends with ML scholars

(1) NLP close to its ImageNet stage?

Papers / posts / articles

(0) Working with multi-modal data

- concatenation-based conditioning

- conditional biasing or scaling ("residual" connections)

- sigmoidal gating

- all in all this approach seems like a mixture of attention / gating for multi-modal problems

(1) Glow, a reversible generative model which uses invertible 1x1 convolutions

(2) Facebooks moonshots - I kind of do not understand much here


(3) RL concept flaws?


(4) Intriguing failures of convolutions - this is fucking amazing

(5) People are only STARTING to apply ML to reasoning

Yet another online book on Deep Learning

(1) Kind of standard!/book/grokking-deep-learning/chapter-1/v-10/1

Libraries / code

(0) Data version control continues to develop




Like this post or have something to say => tell us more in the comments or donate!

Troubling Trends in Machine Learning Scholarship

By Zachary C. Lipton* & Jacob Steinhardt* *equal authorship Originally presented at ICML 2018: Machine

snakers4 (Alexander), July 09, 09:04

2018 DS/ML digest 16

Papers / posts

(0) RL now solves Quake

(1) A post about AdamW

-- Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam

-- Amsgrad turns out to be very disappointing

-- Refresher article

(2) How to tackle new classes in CV

(3) A new word in GANs?



(4) Using deep learning representations for search


-- library for fast search on python

(5) One more paper on GAN convergence

(6) Switchable normalization - adds a bit to ResNet50 + pre-trained models


(0) Disney starts to release datasets

Market / interesting links

(0) A motion to open-source GitHub

(1) Allegedly GTX 1180 start in sales appearing in Asia (?)

(2) Some controversy regarding Andrew Ng and self-driving cars

(3) National AI strategies overviewed -

-- Canada C$135m

-- China has the largest strategy

-- Notably - countries like Finland also have one

(4) Amazon allegedly sells face recognition to the USA



Google’s DeepMind taught AI teamwork by playing Quake III Arena

Google’s DeepMind today shared the results of training multiple AI systems to play Capture the Flag on Quake III Arena, a multiplayer first-person shooter game. The AI played nearly 450,000 g…

snakers4 (Alexander), July 08, 06:06

A new multi-threaded addition to pandas stack?

Read about this some time ago (when this was just in development - found essentially 3 alternatives

- just being clever about optimizing your operations + using what is essentially a multi-threaded map/reduce in pandas

- pandas on ray

- dask (overkill)





So...I ran a test in the notebook I had on hand. It works. More tests will be done in future.



Spark in me - Internet, data science, math, deep learning, philosophy

Pandas on Ray - RISE Lab

snakers4 (spark_comment_bot), July 07, 12:29

Playing with VAEs and their practical use

So, I played a bit with Variational Auto Encoders (VAE) and wrote a small blog post on this topic

Please like, share and repost!



Like this post or have something to say => tell us more in the comments or donate!

Playing with Variational Auto Encoders - PCA vs. UMAP vs. VAE on FMNIST / MNIST

In this article I thoroughly compare the performance of VAE / PCA / UMAP embeddings on a simplistic domain - UMAP Статьи автора - Блог -

snakers4 (Alexander), July 04, 07:57

2018 DS/ML digest 15

What I filtered through this time

Market / news

(0) Letters by big company employees against using ML for weapons

- Microsoft

- Amazon

(1) Facebook open sources Dense Pose (eseentially this is Mask-RCNN)


Papers / posts / NLP

(0) One more blog post about text / sentence embeddings

- key idea different weighting

(1) One more sentence embedding calculation method

- ?

(2) Posts explaing NLP embeddings

- - some basics - SVD / Word2Vec / GloVe

-- SVD improves embedding quality (as compared to ohe)?

-- use log-weighting, use TF-IDF weighting (the above weighting)

- - word embedding properties

-- dimensions vs. embedding quality

(3) Spacy + Cython = 100x speed boost - - good to know about this as a last resort

- described use-case

you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow

or you have a heavy processing logic in your DeepLearning batch loader that slows down your training

(4) Once again stumbled upon this -

(5) Papers

- Simple NLP embedding baseline

- NLP decathlon for question answering

- Debiasing embeddings

- Once again transfer learning in NLP by open-AI -




Download full.pdf 0.04 MB

snakers4 (Alexander), July 02, 04:51

2018 DS/ML digest 14

Amazing article - why you do not need ML


- I personally love plain-vanilla SQL and in 90% of cases people under-use it

- I even wrote 90% of my JSON API on our blog in pure PostgreSQL xD

Practice / papers

(0) Interesting papers from CVPR

(1) Some down-to-earth obstacles to ML deploy

(2) Using synthetic data for CNNs (by Nvidia) -

(3) This puzzles me - so much effort and engineering spent on something ... strange and useless -

On paper they do a cool thing - investigate transfer learning between different domains, but in practice it is done on TF and there is no clear conclusion of any kind

(4) VAE + real datasets - only small Imagenet (64x64)

(5) Understanding the speed of models deployed on mobile -

(6) A brief overview of multi-modal methods

Visualizations / explanations

(0) Amazing website with ML explanations

(1) PCA and linear VAEs are close




No, you don't need ML/AI. You need SQL

A while ago, I did a Twitter thread about the need to use traditional and existing tools to solve everyday business problems other than jumping on new buzzwords, sexy and often times complicated technologies.

snakers4 (Alexander), July 01, 11:48

Measuring feature importance properly

Once again stumbled upon an amazing article about measuring feature importance for any ML algorithms:

(0) Permutation importance - if your ML algorithm is costly, then you can just shuffle a column and check importance

(1) Drop column importance - drop a column, re-train a model, check performance metrics

Why it is useful / caveats

(0) If you really care about understanding your domain - feature importances are a must have

(1) All of this works only for powerful models

(2) Landmines include - correlated or duplicate variables, data normalization

Correlated variables

(0) For RF - correlated variables share permutation importance roughly proportionally to their correlation

(1) Drop column importance can behave unpredictably

I personally like engineering different kinds of features and doing ablation tests:

(0) Among feature sets, sharing similar purpose

(1) Within feature sets


snakers4 (Alexander), June 10, 15:35

And now the article is also live -

Please support us with your likes!



Состязательные атаки (adversarial attacks) в соревновании Machines Can See 2018

Или как я оказался в команде победителей соревнования Machines Can See 2018 adversarial competition. Суть любых состязательных атак на примере. Так уж...

snakers4 (Alexander), June 10, 06:50

An interesting idea from a CV conference

Imagine that you have some kind of algorithm, that is not exactly differentiable, but is "back-propable".

In this case you can have very convoluted logic in your "forward" statement (essentially something in between trees and dynamic programming) - for example a set of clever if-statements.

In this case you will be able to share both of the 2 worlds - both your algorithm (you will have to re-implement in your framework) and backprop + CNN. Nice.

Ofc this works only for dynamic deep-learning frameworks.



Machines Can See 2018 adversarial competition

Happened to join forces with a team that won 2nd place in this competition


It was very entertaining and a new domain to me.

Read more materials:

- Our repo

- Our presentation

- All presentations




Playing with adversarial attacks on Machines Can See 2018 competition

This article is about MCS 2018 competition and my participation in it, adversarial attack methods and how out team won Статьи автора - Блог -

snakers4 (Alexander), June 07, 14:13

snakers4 (Alexander), May 25, 07:29

New competitions on Kaggle

Kaggle has started a new competition with video ... which is one of those competitions (read between the lines - blatant marketing)


- TensorFlow Record files

- Each of the top 5 ranked teams will receive $5,000 per team as a travel award - no real prizes

- The complete frame-level features take about 1.53TB of space (and yes, these are not videos, but extracted CNN features)

So, they are indeed using their platform to promote their business interests.

Released free datasets are really cool, but only when you can use then for transfer learning, which implies also seeing the underlying ground level data (i.e. images of videos).



The 2nd YouTube-8M Video Understanding Challenge

Can you create a constrained-size model to predict video labels?

snakers4 (Alexander), May 19, 14:32

A thorough and short guide to Matplotlib API

A bit of history, small look under the hood and logical explanation of how to use it best:


Python Plotting With Matplotlib (Guide) – Real Python

This article is a beginner-to-intermediate-level walkthrough on Python and matplotlib that mixes theory with example.

snakers4 (Alexander), May 15, 05:04

A great presentation about current state of particle tracking + ML

Also Kaggle failed to share this for some reason

Key problem - current algorithm - Kalman filter faces time constaints


snakers4 (Alexander), May 09, 06:55

A couple of articles about the harsh reality of DS / ML jobs

In a nutshell:

- politics

- unjustified decisions

- same as everywhere



True story, especially about political decisions, fight for power, useless dashboards and data monkeys.


The most difficult thing in data science: politics

Deep learning looks difficult to you? Come back after you get to know company politics, it will feel like a breeze...

snakers4 (Alexander), May 05, 13:13

Andrew Ng book

Is being released on chapter-by-chapter basis


This book is not really technical, though - it's more or less a combination of advice how to build ML models as a business process

Interesting idea - splitting your dev set into black box and eyeball dev set ... but this can be replaced by properly using Tensorboard when training...


snakers4 (Alexander), May 01, 16:52

2018 DS/ML digest 9

Market / libraries

(0) Tensorflow + Swift - wtf -

(1) Geektimes / going international -

(2) A service for renting GPUs ... from people

- Reddit

- Link

- Looks LXC based (afaik - the only user friendly alternative to Docker)

- Cool in theory, no idea how secure this is - we can assume as secure as providing a docker container to stranger

- They did not reply me in a week

(3) A friend sent me a new list of ... new yet another PyTorch NLP libraries

-, (AllenNLP is the biggest library like this)

- I believe that such libraries are more or less useless for real tasks, but cool to know they exist

(4) New SpaceNet 4?

(5) A new super cool competition on Kaggle about particle physics?

Tutorials / basics

(0) Bias vs. Variance (RU)

(1) Yet another magic Jupyter guideline collection -

Real world ML applications

(0) Resnet + object detection (RU) - people wo helmets 90% accuracy -

(1) about using embeddings with Tabular data -

Very similar to our approach on electricity

I personally do not recommend using their library by all means

(2) Comparing Google TPU vs. V100 with ResNet50 -

- speed -

- pricing -

- but ... buying GPUs is much cheaper

(3) Other blog posts about embeddings + tabular data

- Sales prediction

- Taxi drive prediction

MLP + classification + embeddings - /

(4) Albu's solution to SpaceNet - augmentations

CNN overview

Neural network part:

Split data to 4 folds randomly but the same number of each city tiles in every fold

Use resnet34 as encoder and unet-like decoder (conv-relu-upsample-conv-relu) with skip connection from every layer of network. Loss function: 0.8*binary_cross_entropy + 0.2*(1 – dice_coeff). Optimizer – Adam with default params.

Train on image crops 512*512 with batch size 11 for 30 epoch (8 times more images in one epoch)

Train 20 epochs with lr 1e-4

Train 5 epochs with lr 2e-5

Train 5 epochs with lr 4e-6

Predict on full image with padding 22 on borders (1344*1344).

Merge folds by mean

Jobs / job market

(0) Developers by country by scraping GitHub -

- developers count vs. GDP R^2 = 84%

- developers count vs. population - R^2 = 50%


(0) Interactive tool for visualizing convolutions -


(0) Open Images v4 outsourced


- the dataset itself

- categories





swift - Swift for TensorFlow documentation repository.

snakers4 (Alexander), May 01, 05:37

Playing with unsupervised learning in genetics

A small blog post on this topic

The first thing that springs to mind is RNN but what if there is no annotation and it is not known if the data consists of valid sequences?)


Playing with genetic markers, clustering and visualization

Mesmerizing structires found in data: encoding, dimension reduction, clustering and visualization a dataset with genetic markers Статьи автора - Блог -

snakers4 (Alexander), April 28, 08:45

Using Mendeley to read papers

Looks like when you migrate to a new PC it also can migrate your literature library.



snakers4 (Alexander), April 27, 09:58

A handy snippet for `IOU` calculation


Calculating percentage of Bounding box overlap, for image detector evaluation

In testing an object detection algorithm in large images, we check our detected bounding boxes against the coordinates given for the ground truth rectangles. According to the Pascal VOC challenges,

Widen Jupyter editor to 100% wide screen

Just apply this CSS

#texteditor-container {

width: 95%



snakers4 (Alexander), April 26, 04:24

On the surface looks like an interesting competition

Well, I said that about Power Laws - but then it turned out otherwise.

So far I can see CV, NLP and tables in one mix.


Avito Demand Prediction Challenge

Predict demand for an online classified ad

snakers4 (Alexander), April 22, 15:02

DWT article on

DS Bowl article is live on


Please support us with your likes.


Применяем Deep Watershed Transform в соревновании Kaggle Data Science Bowl 2018

Применяем Deep Watershed Transform в соревновании Kaggle Data Science Bowl 2018 Представляем вам перевод статьи по ссылке и оригинальный докеризированный код.

snakers4 (Alexander), April 22, 11:40

snakers4 (Alexander), April 20, 04:58

Useful Python abstractions / sugar / patterns

I already shared a book about patterns, which contains mostly high level / more complicated patters. But for writing ML code sometimes simple imperative function programming style is ok.

So - I will be posting about simple and really powerful python tips I am learning now.

This time I found out about map and filter, which are super useful for data preprocessing:


items = [1, 2, 3, 4, 5]

squared = list(map(lambda x: x**2, items))Filter

number_list = range(-5, 5)

less_than_zero = list(filter(lambda x: x < 0, number_list))

print(less_than_zero)Also found this book -



snakers4 (Alexander), April 17, 19:14

Andrew NG released first 4 chapters of his new book

So far looks not really technical



Download Ng_MLY01.pdf 1.52 MB

snakers4 (Alexander), April 17, 08:50

DS Bowl 2018 top solution


This is really interesting...their approach to separation is cool

snakers4 (Alexander), April 16, 10:17

A draft of the article about DS Bowl 2018 on Kaggle.

This time this was a lottery.

Good that I did not really spend much time, but this time I learned a lot about watershed and some other instance segmentation methods!

An article is accompanied by a dockerized PyTorch code release on GitHub:



This is a beta, you are welcome to comment and respond.





Applying Deep Watershed Transform to Kaggle Data Science Bowl 2018 (dockerized solution)

In this article I will describe my solution to the DS Bowl 2018 and why it was a lottery and post a link to my dockerized solution Статьи автора - Блог -

snakers4 (Alexander), April 15, 08:06

2018 DS/ML digest 8

As usual my short bi-weekly (or less) digest of everything that passed my BS detector

Market / blog posts

(0) about the importance of accessibility in ML -

(1) Some interesting news about market, mostly self-driving cars (the rest is crap) -

(2) US$600m investment into Chinese face recognition -

Libraries / frameworks / tools

(0) New 5 point face detector in Dlib for face alignment task -

(1) Finally a more proper comparsion of XGB / LightGBM / CatBoost - (also see my thoughts here

(3) CNNs on FPGAs by ZFTurbo



(4) Data version control - looks cool



-- but I will not use it - becasuse proper logging and treating data as immutable solves the issue

-- looks like over-engineering for the sake of overengineering (unless you create 100500 datasets per day)


(0) TF Playground to seed how simplest CNNs work -


(0) Looks like GAN + ResNet + Unet + content loss - can easily solve simpler tasks like deblurring

(1) You can apply dilated convolutions to NLP tasks -

(2) High level overview of face detection in -

(3) Alternatives to DWT and Mask-RCNN / RetinaNet?

- Has anybody tried anything here?


(0) A more disciplined approach to training CNNs - (LR regime, hyper param fitting etc)

(1) GANS for iamge compression -

(2) Paper reviews from ODS - mostly moonshots, but some are interesting



(3) SqueezeNext - the new SqueezeNet -




snakers4 (Alexander), April 12, 08:08

DS Bowl 2018 stage 2 data was released.

It has completely different distribution from stage 1 data.

How do you like them, apples?

Looks like Kaggle admins really have no idea about dataset curation, or all of this is mean to misguide manual annotators.

Anyway - looks like random bs.