Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1227 members, 1357 posts since 2016

All this - lost like tears in rain.

Internet, data science, math, deep learning, philosophy. No bs.

Our website
Our chat
DS courses review

snakers4 (Alexander), March 20, 05:03

Internet / tech

(1) LIDAR - bridge technology

(2) VW to invest US$25bn in batteries

(3) Self-driving car kills a pedestrian

(4) Terminal case of marketing bs - theranos -

(5) Spotify was a P2P app at first lol -

(6) Stack Overflow survey 2018 -


(1) Prototype of small flying car -


Bridges and LIDAR

A bridge product says 'of course x is the right way to do this, but the technology or market environment to deliver x is not available yet, or is too expensive, and so here is something that gives some of the same benefits but works now.'  Sometimes that’s a great business, and sometimes it

snakers4 (Alexander), March 19, 09:26

Before there was an unofficial Kaggle CLI tool, now there is an official Kaggle API tool


Lol..and ofc data download did not work...unlike the unofficial tool.

Maybe submits will work.


kaggle-api - Official Kaggle API

snakers4 (Alexander), March 18, 03:50

AI Learns Human Pose Estimation From Videos | Two Minute Papers #237
The paper "DensePose: Dense Human Pose Estimation In The Wild" is available here: Our Patreon page: ht...

snakers4 (Alexander), March 16, 13:15

New cool trick - use

pd.to_feather()instead of pickle or csv

Supposed to work much faster as it dumps the data same way its located in RAM


snakers4 (Alexander), March 15, 17:32

AI-Based Animoji Without The iPhone X | Two Minute Papers #236
The paper "Avatar Digitization From a Single Image For Real-Time Rendering" is available here:

snakers4 (Alexander), March 14, 09:53

A nice thread about ML reproducibility


[D] Are the hyper-realistic results of... • r/MachineLearning

**Tacotron-2:** **Wavenet:**...

snakers4 (Alexander), March 13, 16:27

Why GitHub Won't Help You With Hiring

snakers4 (Alexander), March 13, 10:08

Internet digest

(1) Ben Evans -


(1) Waymo launching pilot for the self-driving trucks -

(2) Netflix to spend US$8bn on ~700 shows in 2018 - (sic!)

(3) Intel vs Qualcomm and Broadcomm - + Inter considering to buy Broadcomm -

(4) Amazon buys ring -

(5) Latest darkmarket bust - Hansa - - it was not busted at once, but put under surveillance

- As with Silk Road - all started with the officials finding a server and making a copy of hard drive

- This time - it was a dev server

- It contained ... owners' IRC accounts and some personal info

Internet + ML

(1) Netflix uses ML to generate thumbnails for its shows automatically -

- Features collected: manual annotation, meta-data, object detection, brightness, colour, face detection, blur, motion detection, actors, mature content



Also also

(1) Dropbox -

(2) And Spotify

filed for IPOs


snakers4 (Alexander), March 13, 09:14

An article about how to use CLI params in python with argparse


If this is too slow - then just use this as a starter boilerplate

- (this is how I learned it)

Why do you need this?

- Run long overnight (or even day long) jobs in python

- Run multiple experiments

- Make your code more tractable for other people

- Expose a simple API for others to use

The same can be done via newer frameworks, but why learn an abstraction, that may die soon, instead of using instruments that worked for decades?


Python, argparse, and command line arguments - PyImageSearch

In this tutorial I discuss what command line arguments are, why we use them, and how to use argparse + Python for command line arguments.

snakers4 (Alexander), March 11, 15:51

A Photo Enhancer AI | Two Minute Papers #235
The paper "DSLR-Quality Photos on Mobile Devices with Deep Convolutional Networks" and its demo is available here: http:/...

snakers4 (Alexander), March 11, 05:30

Nice intuitions behind watershed algorithm


The Watershed Transformation page

Image segmentation by watershed transformation.

snakers4 (Alexander), March 10, 17:29

Pandas on Ray - RISE Lab

Pandas on Ray - RISE Lab

View the code on Gist.

snakers4 (Alexander), March 10, 13:59

Interesting / noteworthy semseg papers

In practice - UNet and LinkNet are best and simple solutions.

Rarely people report that something like Tiramisu works properly.

Though I saw once in last Konika competition - a good solution based on DenseNet + Standard decoder.

So I decided to read some of the newer and older Semseg papers.

Classic papers

UNet,LinkNet - nuff said

(0) Links

- UNet -

- LinkNet -

Older, overlooked, but interesting papers

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

One of original papers before UNet


(1) Basically UNet w/o skip connections but it stores pooling indices

(1) SegNet uses the max pooling indices to upsample (without learning) the feature map(s) and convolves with a trainable decoder filter bank

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Paszke, Adam / Chaurasia, Abhishek / Kim, Sangpil / Culurciello, Eugenio

(0) Link


(1) Key facts

- Up to 18× faster, 75× less FLOPs, 79× less parameters vs SegNet or FCN

- Supposedly runs on NVIDIA Jetson TX1 Embedded Systems

- Essentially a minzture of ResNet and Inception architectures

- Overview of the architecture



(2) Interesting ideas

- Visual information is highly spatially redundant, and thus can be compressed into a more efficient representation

- Highly assymetric - decoder is much smaller

- Dilated convolutions in the middle => significant accuracy boost

- Dropout > L2

- Pooling operation in parallel with a convolution of stride 2, and concatenate resulting feature maps

Newer papers

xView: Objects in Context in Overhead Imagery - new "Imagenet" for satellite images

(0) Link

- Will be available here

(1) Examples



(2) Stats

- 0.3m ground sample distance

- 60 classes in 7 different parent classes

- 1 million labeled objects covering over 1,400 km2 of the earth’s surface

- classes

(3) Baseline

- Their baseline using SSD has very poor performance ~20% mAP

Rethinking Atrous Convolution for Semantic Image Segmentation

(0) Link


- Liang-Chieh Chen George / Papandreou Florian / Schroff Hartwig / Adam

- Google Inc.

(1) Problems to be solved

- Reduced feature resolution

- Objects at multiple scales

(2) Key approaches

- Image pyramid (reportedly works poorly and requires a lot of memory)

- Encoder-decoder

- Spatial pyramid pooling (reportedly works poorly and requires a lot of memory)

(3) Key ideas

- Atrous (dilated) convolution -

- ResNet + Atrous convolutions -

- Atrous Spatial Pyramid Pooling block -

(4) Performance

- As with the latest semseg methods, true performance boost is unclear

- I would argue that such methods may be useful for large objects



snakers4 (Alexander), March 09, 05:49

PyTorch caught up with keras


François Chollet

TensorFlow is the platform of choice for deep learning in the research community. These are deep learning framework mentions on arXiv over the past 3 months

snakers4 (Alexander), March 08, 18:49

Коэффициент Джини. Из экономики в машинное обучение / Хабрахабр

Коэффициент Джини. Из экономики в машинное обучение

Интересный факт: в 1912 году итальянский статистик и демограф Коррадо Джини написал знаменитый труд «Вариативность и изменчивость признака», и в этом же году...

snakers4 (Alexander), March 07, 06:03

New articles about picking GPUs for DL



Also most likely for 4-GPU set-ups in a case some of them may be required to have water cooling to avoid thermal throttling.


Picking a GPU for Deep Learning

Buyer’s guide at the beginning of 2018

snakers4 (Alexander), March 07, 05:03

2018 DS/ML digest 6


(1) A new amazing post by Google on distil -

This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community -

This is how the CNN sees the image -

Expect this to be packaged as part of Tensorboard in a year or so)


(1) New landmark dataset by Google - - looks cool, but ...

Prizes in the accompanying Kaggle competitions are laughable

Given that datasets are really huge...~300G

Also also if you win, you will have to buy a ticket to the USA on your money ...

(2) Useful script to download the images

(3) Imagenet for satellite imagery - - pre-register paper

(4) CVPR 2018 for satellite imagery -

Papers / new techniques

(1) Improving RNN performance via auxiliary loss -

(2) Satellite imaging for emergencies -

(3) Baidu - neural voice cloning -


(1) Google TPU benchmarks -

As usual such charts do not show consumer hardware.

My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)

Miners say that 1080Ti can work 1-2 years non-stop

(2) MIT and SenseTime announce effort to advance artificial intelligence research

(3) Google released its ML course - - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts


(1) Interesting thing - all ISPs have some preferential agreements between each other -




The Building Blocks of Interpretability

Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them -- and the rich structure of this combinatorial space.

snakers4 (Alexander), March 07, 02:00


keras - Deep Learning for humans

snakers4 (Alexander), March 07, 01:53

Building Blocks of AI Interpretability | Two Minute Papers #234
The paper "Building Blocks of Interpretability" is available here: Our Patreon page:

This is amazing

The Building Blocks of Interpretability

Interpretability techniques are normally studied in isolation. We explore the powerful interfaces that arise when you combine them -- and the rich structure of this combinatorial space.

snakers4 (Alexander), March 06, 14:34

I have seen questions on forums - how to add Keras-like progress bar to PyTorch for simple models?

The answer is to use tqdm and this property


This example is also great

from tqdm import trange

from random import random, randint

from time import sleep

t = trange(100)

for i in t:

# Description will be displayed on the left

t.set_description('GEN %i' % i)

# Postfix will be displayed on the right, and will format automatically

# based on argument's datatype

t.set_postfix(loss=random(), gen=randint(1,999), str='h', lst=[1, 2])




Can I add message to the tqdm progressbar?

When using the tqdm progress bar: can I add a message to the same line as the progress bar in a loop? I tried using the "tqdm.write" option, but it adds a new line on every write. I would like each

snakers4 (Alexander), March 06, 13:09

So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.

My thoughts are below.

(1) Installation - CPU

(all) - are installed via pip or conda in one line

(2) Installation - GPU

(xgb) - easily done via following their instructions, only nvidia drivers required;

(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;

(cb) - instructions were too convoluted for me to follow;

(3) Docs / examples

(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;

(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;

(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;

(4) Regression

(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;

(lgb) - best performing one out of the box, fast + accurate;

(cb) - fast but less accurate;

(5) Classification

(xgb) - best accuracy

(lgb) - fast, high accuracy

(cb) - fast, worse accuracy

(6) GPU usage

(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.


snakers4 (Alexander), March 05, 06:45

Found some starter boilerplate of how to use hyperopt instead of gridsearch for faster search:

- here -

- and here -



catboost - CatBoost is an open-source gradient boosting on decision trees library with categorical features support out of the box for Python, R

snakers4 (Alexander), March 04, 06:26

Amazing article about the most popular warning in Pandas



A guide

Everything you need to know about the most common (and most misunderstood) warning in #pandas. #python

snakers4 (Alexander), March 03, 09:03

Modern Pandas series about classic time-series algorithms


Some basic boilerplate and baselines



datas-frame – Modern Pandas (Part 7): Timeseries

Posts and writings by Tom Augspurger

snakers4 (Alexander), March 02, 02:50

It is tricky to launch XGB fully on GPU. People report that on the same data CatBoost has inferior quality w/o tweaking (but is faster). LightGBM is reported to be faster and to have the same accuracy.

So I tried adding LighGBM w GPU support to my Dockerfile - - but I encountered some driver Docker issues.

One of the caveats I understood - it supports only older Nvidia drivers, up to 384.

Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)



LightGBM - A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine lea...

snakers4 (Alexander), March 02, 02:19

A framework to deploy and maintain models by instacart - - please tell me if anybody tried it

How to build a deep learning model in 15 minutes

An open source framework for configuring, building, deploying and maintaining deep learning models in Python.

snakers4 (Alexander), March 01, 17:10

DeepMind's WaveNet, 1000 Times Faster | Two Minute Papers #232
The paper "Parallel WaveNet: Fast High-Fidelity Speech Synthesis" is available here: Our Patreon page: www.patreon.c...

snakers4 (Alexander), March 01, 07:47


tensorflow - Computation using data flow graphs for scalable machine learning

snakers4 (Alexander), February 28, 10:40

Forwarded from Data Science:

Most common libraries for Natural Language Processing:

CoreNLP from Stanford group:

NLTK, the most widely-mentioned NLP library for Python:

TextBlob, a user-friendly and intuitive NLTK interface:

Gensim, a library for document similarity analysis:

SpaCy, an industrial-strength NLP library built for performance:


#nlp #digest #libs

Stanford CoreNLP

High-performance human language analysis tools. Widely used, aavailable open source; written in Java.

snakers4 (Alexander), February 27, 15:49

A great survey - how to work with imbalanced data


8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset - Machine Learning Mastery

Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn! This is an example of an imbalanced dataset and the frustrating results it can …