Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1252 members, 1404 posts since 2016

All this - lost like tears in rain.

Internet, data science, math, deep learning, philosophy. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), April 17, 19:14

Andrew NG released first 4 chapters of his new book

So far looks not really technical



Download Ng_MLY01.pdf 1.52 MB

snakers4 (Alexander), April 17, 08:50

DS Bowl 2018 top solution


This is really interesting...their approach to separation is cool

snakers4 (Alexander), April 16, 10:17

A draft of the article about DS Bowl 2018 on Kaggle.

This time this was a lottery.

Good that I did not really spend much time, but this time I learned a lot about watershed and some other instance segmentation methods!

An article is accompanied by a dockerized PyTorch code release on GitHub:



This is a beta, you are welcome to comment and respond.





Applying Deep Watershed Transform to Kaggle Data Science Bowl 2018 (dockerized solution)

In this article I will describe my solution to the DS Bowl 2018 and why it was a lottery and post a link to my dockerized solution Статьи автора - Блог -

snakers4 (Alexander), April 15, 08:06

2018 DS/ML digest 8

As usual my short bi-weekly (or less) digest of everything that passed my BS detector

Market / blog posts

(0) about the importance of accessibility in ML -

(1) Some interesting news about market, mostly self-driving cars (the rest is crap) -

(2) US$600m investment into Chinese face recognition -

Libraries / frameworks / tools

(0) New 5 point face detector in Dlib for face alignment task -

(1) Finally a more proper comparsion of XGB / LightGBM / CatBoost - (also see my thoughts here

(3) CNNs on FPGAs by ZFTurbo



(4) Data version control - looks cool



-- but I will not use it - becasuse proper logging and treating data as immutable solves the issue

-- looks like over-engineering for the sake of overengineering (unless you create 100500 datasets per day)


(0) TF Playground to seed how simplest CNNs work -


(0) Looks like GAN + ResNet + Unet + content loss - can easily solve simpler tasks like deblurring

(1) You can apply dilated convolutions to NLP tasks -

(2) High level overview of face detection in -

(3) Alternatives to DWT and Mask-RCNN / RetinaNet?

- Has anybody tried anything here?


(0) A more disciplined approach to training CNNs - (LR regime, hyper param fitting etc)

(1) GANS for iamge compression -

(2) Paper reviews from ODS - mostly moonshots, but some are interesting



(3) SqueezeNext - the new SqueezeNet -




snakers4 (Alexander), April 12, 08:08

DS Bowl 2018 stage 2 data was released.

It has completely different distribution from stage 1 data.

How do you like them, apples?

Looks like Kaggle admins really have no idea about dataset curation, or all of this is mean to misguide manual annotators.

Anyway - looks like random bs.



snakers4 (Alexander), April 10, 09:42

Yolov3 - best paper.

But not in terms of scientific contribution, but rebuttal of DS community BS.

Very funny read.


If you want a proper comparison of object detection algorithms - use this paper

Looks like SSD and YOLO are reasonably good and fast, and RCNN can be properly tuned to be 3-5x slower (not 100x) and more accurate.



Download YOLOv3.pdf 2.29 MB

snakers4 (Alexander), April 08, 13:48

As you may know (for newer people on the channel), sometimes we publish small articles on the website.

This time it covers a recent Power Laws challenge on DrivenData, which at first seemed legit and cool, but in the end turned back into a pumpkin.

Here is an article:





Playing with electricity - forecasting 5000 time series

In this article I share our experience participating in a recent time series challenge on Drivendata and my personal ideas about ML competitions Статьи автора - Блог -

snakers4 (Alexander), March 30, 03:43

Finally a good piece on RF feature selection


snakers4 (Alexander), March 29, 08:02

Pandas vs. numpy speed benchmarks




Fast-Pandas - Benchmark for different operations in pandas against various dataframe sizes.

snakers4 (Alexander), March 26, 13:26

NLP project peculiarities

(0) Always handle new words somehow

(1) Easy evaluation of test results - you can just look at it

(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects

(0) Basic pipeline

(1) Basic preprocessing

- Stemming / lemmatization

- Regular expressions

(2) Naive / old school approaches that can just work

- Bag of Words => simple model

- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings

- Average / sum of Word2Vec embeddings

- Word2Vec * tf-idf >> Doc2Vec

- Small documents => embeddings work better

- Big documents => bag of features / high level features

(4) Sentiment analysis features


- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec



(6) Transfer learning in NLP - looks like it may become more popular / prominent

- Jeremy Howard's preprint on NLP transfer learning -



ML tutorial for NLP, Алексей Натекин

snakers4 (Alexander), March 26, 09:56

So, I have briefly watched Andrew Ng's series on RNNs.

It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.

Also he explains stuff with really simple and clear illustrations.

Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.

(which I did enough during his classic course)

Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:

(0) Key NLP sota achievements in 2017



(1) Consider courses and notebooks

(2) Consider NLP newsletter

(3) Consider excellent PyTorch tutorials

(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)

(5) Brief 1-hour intro to practical NLP

Also related posts on the channel / libraries:

(1) Pre-trained vectors in Russian -

(2) How to learn about CTC loss (when our seq2seq )

(3) Most popular MLP libraries for English -

(4) NER in Russian -

(5) Lemmatization library in Russian - - recommended by a friend

Basic tasks considered more or less solved by RNNs

(1) Speech recognition / trigger word detection

(2) Music generation

(3) Sentiment analysis

(4) Machine translation

(5) Video activity recognition / tagging

(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:

(1) Different length of input and output

(2) Features for different positions in the sequence are not shared

(3) Enormous number of params

Typical word representations

(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)

(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs

(0) Typical architectures - deep GRU (lighter) and LSTM cells

(1) Tanh or RELU for hidden layer activation

(2) Sigmoid for output when classifying

(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens

(4) Usually word level models are used (not character level)

(5) Passing hidden state in encoder-decoder architectures

(6) Vanishing gradients - typically GRUs / LSTMs are used

(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)

(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)

(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass

(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state

(11) Finetune imported embeddings for smaller tasks with smaller datasets

(12) On big datasets - may make sense to learn embeddings from scratch

(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors

(0) Cosine (angle)

(1) Eucledian

Seminal papers / consctructs / ideas:

(1) Training embeddings - the later the methods came out - the simpler they are

- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)


-- useful ideas

-- if there is information - a simple model (i.e. logistic regression) will work

-- negative subsampling -

Advances in NLP in 2017

Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other…

sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words

-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update

-- skip-gram model in a nutshell -

- GloVe - Global Vectors (2014)


-- supposedly GloVe is better given same resources than Word2Vec -

-- in practice word vectors with 200 dimensions are enough for applied tasks

-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation

- essentially an exp of modified precision index for logs of 4 n-grams



(3) Attention is all you need


To be continued.





Captured with Lightshot

snakers4 (Alexander), March 23, 10:26

Finally a proper LightGBM / XGB / CatBoost practical comparsion!


CatBoost vs. Light GBM vs. XGBoost

Who is going to win this war of predictions and on what cost? Let’s explore.

snakers4 (Alexander), March 20, 13:40

A video about realistic state of chat-bots (RU)




Neural Networks and Deep Learning lab at MIPT

При звонке в сервис или банк диалог с поддержкой строится по одному и тому же сценарию с небольшими вариациями. Отвечать «по бумажке» на вопросы может и не человек, а чат-бот. О том, как нейронные...

snakers4 (Alexander), March 20, 06:57

A practical note on using pd.to_feather()

Works really well, if you have an NVME drive and you want to save a large dataframe to disk in binary format.

If your NVME is properly installed it will give you 1.5-2+GB/s read/write speed, so even if your df is 20+GB in size, it will read literally in seconds.

The ETL process to produce such a df may take minutes.


snakers4 (Alexander), March 16, 13:15

New cool trick - use

pd.to_feather()instead of pickle or csv

Supposed to work much faster as it dumps the data same way its located in RAM


snakers4 (Alexander), March 13, 09:14

An article about how to use CLI params in python with argparse


If this is too slow - then just use this as a starter boilerplate

- (this is how I learned it)

Why do you need this?

- Run long overnight (or even day long) jobs in python

- Run multiple experiments

- Make your code more tractable for other people

- Expose a simple API for others to use

The same can be done via newer frameworks, but why learn an abstraction, that may die soon, instead of using instruments that worked for decades?


Python, argparse, and command line arguments - PyImageSearch

In this tutorial I discuss what command line arguments are, why we use them, and how to use argparse + Python for command line arguments.

snakers4 (Alexander), March 07, 05:03

2018 DS/ML digest 6


(1) A new amazing post by Google on distil -

This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community -

This is how the CNN sees the image -

Expect this to be packaged as part of Tensorboard in a year or so)


(1) New landmark dataset by Google - - looks cool, but ...

Prizes in the accompanying Kaggle competitions are laughable

Given that datasets are really huge...~300G

Also also if you win, you will have to buy a ticket to the USA on your money ...

(2) Useful script to download the images

(3) Imagenet for satellite imagery - - pre-register paper

(4) CVPR 2018 for satellite imagery -

Papers / new techniques

(1) Improving RNN performance via auxiliary loss -

(2) Satellite imaging for emergencies -

(3) Baidu - neural voice cloning -


(1) Google TPU benchmarks -

As usual such charts do not show consumer hardware.

My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)

Miners say that 1080Ti can work 1-2 years non-stop

(2) MIT and SenseTime announce effort to advance artificial intelligence research

(3) Google released its ML course - - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts


(1) Interesting thing - all ISPs have some preferential agreements between each other -




The Building Blocks of Interpretability

Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them -- and the rich structure of this combinatorial space.

snakers4 (Alexander), March 06, 13:09

So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.

My thoughts are below.

(1) Installation - CPU

(all) - are installed via pip or conda in one line

(2) Installation - GPU

(xgb) - easily done via following their instructions, only nvidia drivers required;

(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;

(cb) - instructions were too convoluted for me to follow;

(3) Docs / examples

(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;

(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;

(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;

(4) Regression

(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;

(lgb) - best performing one out of the box, fast + accurate;

(cb) - fast but less accurate;

(5) Classification

(xgb) - best accuracy

(lgb) - fast, high accuracy

(cb) - fast, worse accuracy

(6) GPU usage

(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.


snakers4 (Alexander), March 05, 06:45

Found some starter boilerplate of how to use hyperopt instead of gridsearch for faster search:

- here -

- and here -



catboost - CatBoost is an open-source gradient boosting on decision trees library with categorical features support out of the box for Python, R

snakers4 (Alexander), March 04, 06:26

Amazing article about the most popular warning in Pandas



A guide

Everything you need to know about the most common (and most misunderstood) warning in #pandas. #python

snakers4 (Alexander), March 03, 09:03

Modern Pandas series about classic time-series algorithms


Some basic boilerplate and baselines



datas-frame – Modern Pandas (Part 7): Timeseries

Posts and writings by Tom Augspurger

snakers4 (Alexander), March 02, 02:50

It is tricky to launch XGB fully on GPU. People report that on the same data CatBoost has inferior quality w/o tweaking (but is faster). LightGBM is reported to be faster and to have the same accuracy.

So I tried adding LighGBM w GPU support to my Dockerfile - - but I encountered some driver Docker issues.

One of the caveats I understood - it supports only older Nvidia drivers, up to 384.

Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)



LightGBM - A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine lea...

snakers4 (Alexander), February 27, 15:49

A great survey - how to work with imbalanced data


8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn! This is an example of an imbalanced dataset and the frustrating results it can …

snakers4 (Alexander), February 24, 05:56

2017 DS/ML digest 5

Fun stuff

(1) Hardcore metal + CNNs + style transfer -

SpaceNet challenge

(1) Post by Nvidia

(2) Some links to sota semseg articles

(3) Useful tools for CV - floodfill and grabcut, but guys from Nvidia did not notice ... that road width was in geojson data...

(4) Looks like they replicated the results just for PR, but their masks do not look appealing

Research / papers / libraries

(1) Neural Voice Cloning with a Few Samples - (demos

(2) A library for CRFs in Python -

(3) 1000x faster CNN architecture search - still on CIFAR - (PyTorch

(4) URLs + CNN - malicious link detection -


(1) 3m anime image dataset -

(2) Google HDR dataset -


(1) Idea - AMT + blockchain -

(2) ARM to make processors for CNNs? -

(3) Google TPU in beta - - very expensive. + Note the rumours that Google's own people do not use their TPU quota

(4) One guy managed to deploy a PyTorch model using ONNX -




Hardcore Anal Hydrogen "Jean-Pierre" (2018, Apathia Records)

Order "Hypercut" : Bandcamp : « A gigantic piece of art here to mess with wh...

snakers4 (Alexander), February 22, 09:41

So ofc I tried the new Jupyter lab.

And it is really cool that something so simple / cool / useful is completely free / no strings attached (yet). But I will not use it professionally.

Use my Dockerfile if you want to check it out with my DL environment:


But in a nutshell it worked with jpn params inside the container

CMD jupyter lab --port=8888 --ip= --no-browserAnd installation is as easy as

conda install -c conda-forge jupyterlabDocs are a bit sparse for now


But this is a list of reasons, why you might consider sticking to ssh pass-through for auto-complete / terminal and jupyter notebook with extensions:

(0) It is still in beta, so unless your professional path is connected with node-js / web - you better pass now

(1) The existence of amazing extensions for Jupyter notebook that do 95% of what you might need -

(2) Built-it terminal is much better than before, but it pales in comparison with Putty or even standard linux shell (autocomplete?)

(3) Some of built-in extensions like image viewer are really useful, but overall the product is a bit beta (which they openly say it is)

And here is why turning Jupyter notebook into a real environment is really cool:

(1) Building everything based on extensions IS REALLY COOL - and in the long run will encourage people to port jupyter extensions and build a really powerful tool. Also this implies diversity and freedom unlike shitty tools like Zeppelin

(2) After some effort, it may really replace terminal, IDE, desktop environment and notebooks for data-oriented people (I guess 6-12 monhts)

(3) Structuring extensions and npm packages lures the most fast developing web-developer community to support the project and provides transparency and clarity


Dockerfile update

snakers4 (Alexander), February 15, 09:50

Visualizing Large-scale and High-dimensional Data:

A paper behind an awesome library


Follows the success of T-SNE, but is MUCH faster

Typical visualization pipeline

Also works awesomely with

Convergence speed

(1) (on a machine with 512GB memory, 32 cores at 2.13GHz)

(2) 3m data points * 100 dimensions, LargeVis is up 30x faster at graph construction and 7x at graph visualization





T-SNE drawbacks

(1) K-nearest neighbor graph = computational bottleneck

(2) T-SNE constructs the graph using the technique of vantage-point trees, the performance of which significantly deteriorates for high dimensions

(3) Parameters of the t-SNE are very sensitive on different data sets

Algorithm itself

(1) Create a small number of projection trees (similar to random forest). Then for each node of the graph search the neighbors of its neighbors, which are also likely to be candidates of its nearest neighbors

(2) Use SGD (or asyncronous SGD) to minize graph loss

(3) Clever sampling - sample the edges with the probability proportional to their weights and then treat the sampled edges as binary edges. Also sample some negative (not observed) edges




umap - Uniform Manifold Approximation and Projection

snakers4 (Alexander), February 14, 11:48

2017 DS/ML digest 4

Applied cool stuff

- How Dropbox build their OCR - via CTC loss -

Fun stuff

- CNN forward pass done in Google Sheets -

- New Boston Robotics robot - opens doors now -

- Cool but toothless list of jupyter notebooks with illustrations and models

- Best CNN filter visualization tool ever -

New directions / moonshots / papers

- IMPALA from Google - DMLab-30, a set of new tasks that span a large variety of challenges in a visually unified environment with a common action space



- Trade crypto via RL -

- SparseNets? -

- Use Apple watch data to predict diseases

- Google - Evolution in auto ML kicks in faster than RL -

- R-CNN for human pose estimation + dataset

-- Website + video

-- Paper

Google's Colaboratory gives free GPUs?

- Old GPUs

- 12 hours limit, but very cool in theory



Sick sad world

- China has police Google Glass with face recognition

- Why slack sucks -

-- Email + google docs is better for real communication


- Globally there are 22k ML developers

- One more AI chip moonshot -

- Google made their TPUs public in beta - US$6 per hour

- CNN performance comparable to human level in dermatology (R-CNN) -

- Deep learning is greedy, brittle, opaque, and shallow

- One more medical ML investment - US$25m for cancer -




snakers4 (Alexander), February 14, 04:54

Article on SpaceNet Challenge Three in Russian on habrhabr - please support us with your comments / upvotes


Also if you missed:

- The original article

- The original code release

... and Jeremy Howard from retweeted our solution, lol



But to give some idea which pain the TopCoder platform induces on the contestants, you can read

- Data Download guide

- Final testing guide

- Code release for their verification process




Из спутниковых снимков в графы (cоревнование SpaceNet Road Detector) — попадание топ-10 и код (перевод)

Привет, Хабр! Представляю вам перевод статьи. Это Вегас с предоставленной разметкой, тестовым датасетом и вероятно белые квадраты — это отложенная валидация...

snakers4 (Alexander), February 12, 04:18

Useful links about Datashader

- Home -

- Youtube presentation, practical presentations

-- OpenSky

-- 300M census data

-- NYC Taxi data

- Readme (md is broken)

- Datashader pipeline - what you need to understand to use it with examples -

Also see 2 images above)



Datashader — Datashader 0.6.5 documentation

Turns even the largest data into images, accurately.

snakers4 (Alexander), February 10, 17:31

So, I accidentally was able to talk to the Vice President of GameWorks in Nvidia in person =)

All of this should be taken with a grain of salt. I am not endorsing Nvidia.

- In the public part of the speech he spoke about public Nvidia research projects - most notable / fresh was Nvidia Holodeck - their VR environment

- Key insight - even despite the fact that Rockstar forbid to use GTA images for deep learning, he believes that artificial images used for annotation will be the future of ML because game engines and OS are the most complicated software ever

Obviously, I asked interesting question afterwards =) Most notably about about GPU market and forces

- GameWorks = 200 people doing AR / VR / CNN research

- The biggest team in Nvidia is 2000 - drivers

- Ofc he refused to reply when new generation GPUs will be released and whether the rumour about their current generation GPUs being not produced anymore is true

- He says they are mostly software company focusing on drivers

- Each generation cycle takes 3 years, Nvidia has only one architecture per generation, all the CUDA / ML stuff was planned in 2012-2014

- A rumour about Google TPU. Google has an internal quota - allegedly (!) they cannot buy more GPUs than TPUs, but TPUs are 1% utilized and allegedly they lure Nvidia people to optimize their GPUs to make sure they use this quota efficently

- AMD R&D spend on both CPU and GPU is less than Nvidia spend on GPU

- He says that newest AMD have more 30-40% FLOPs, but they are compared against previous generation consumer GT cards on synthetic tests. AMD does not have a 2000 people driver team...

- He says that Intel has 3-5 new architectures in the works - which may a problem