Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1234 members, 1316 posts since 2016

All this - lost like tears in rain.

Internet, data science, math, deep learning, philosophy. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), February 15, 09:50

Visualizing Large-scale and High-dimensional Data:

A paper behind an awesome library


Follows the success of T-SNE, but is MUCH faster

Typical visualization pipeline

Also works awesomely with

Convergence speed

(1) (on a machine with 512GB memory, 32 cores at 2.13GHz)

(2) 3m data points * 100 dimensions, LargeVis is up 30x faster at graph construction and 7x at graph visualization





T-SNE drawbacks

(1) K-nearest neighbor graph = computational bottleneck

(2) T-SNE constructs the graph using the technique of vantage-point trees, the performance of which significantly deteriorates for high dimensions

(3) Parameters of the t-SNE are very sensitive on different data sets

Algorithm itself

(1) Create a small number of projection trees (similar to random forest). Then for each node of the graph search the neighbors of its neighbors, which are also likely to be candidates of its nearest neighbors

(2) Use SGD (or asyncronous SGD) to minize graph loss

(3) Clever sampling - sample the edges with the probability proportional to their weights and then treat the sampled edges as binary edges. Also sample some negative (not observed) edges




umap - Uniform Manifold Approximation and Projection

snakers4 (Alexander), February 14, 11:48

2017 DS/ML digest 4

Applied cool stuff

- How Dropbox build their OCR - via CTC loss -

Fun stuff

- CNN forward pass done in Google Sheets -

- New Boston Robotics robot - opens doors now -

- Cool but toothless list of jupyter notebooks with illustrations and models

- Best CNN filter visualization tool ever -

New directions / moonshots / papers

- IMPALA from Google - DMLab-30, a set of new tasks that span a large variety of challenges in a visually unified environment with a common action space



- Trade crypto via RL -

- SparseNets? -

- Use Apple watch data to predict diseases

- Google - Evolution in auto ML kicks in faster than RL -

- R-CNN for human pose estimation + dataset

-- Website + video

-- Paper

Google's Colaboratory gives free GPUs?

- Old GPUs

- 12 hours limit, but very cool in theory



Sick sad world

- China has police Google Glass with face recognition

- Why slack sucks -

-- Email + google docs is better for real communication


- Globally there are 22k ML developers

- One more AI chip moonshot -

- Google made their TPUs public in beta - US$6 per hour

- CNN performance comparable to human level in dermatology (R-CNN) -

- Deep learning is greedy, brittle, opaque, and shallow

- One more medical ML investment - US$25m for cancer -




snakers4 (Alexander), February 14, 04:54

Article on SpaceNet Challenge Three in Russian on habrhabr - please support us with your comments / upvotes


Also if you missed:

- The original article

- The original code release

... and Jeremy Howard from retweeted our solution, lol



But to give some idea which pain the TopCoder platform induces on the contestants, you can read

- Data Download guide

- Final testing guide

- Code release for their verification process




Из спутниковых снимков в графы (cоревнование SpaceNet Road Detector) — попадание топ-10 и код (перевод)

Привет, Хабр! Представляю вам перевод статьи. Это Вегас с предоставленной разметкой, тестовым датасетом и вероятно белые квадраты — это отложенная валидация...

snakers4 (Alexander), February 12, 04:18

Useful links about Datashader

- Home -

- Youtube presentation, practical presentations

-- OpenSky

-- 300M census data

-- NYC Taxi data

- Readme (md is broken)

- Datashader pipeline - what you need to understand to use it with examples -

Also see 2 images above)



Datashader — Datashader 0.6.5 documentation

Turns even the largest data into images, accurately.

snakers4 (Alexander), February 10, 17:31

So, I accidentally was able to talk to the Vice President of GameWorks in Nvidia in person =)

All of this should be taken with a grain of salt. I am not endorsing Nvidia.

- In the public part of the speech he spoke about public Nvidia research projects - most notable / fresh was Nvidia Holodeck - their VR environment

- Key insight - even despite the fact that Rockstar forbid to use GTA images for deep learning, he believes that artificial images used for annotation will be the future of ML because game engines and OS are the most complicated software ever

Obviously, I asked interesting question afterwards =) Most notably about about GPU market and forces

- GameWorks = 200 people doing AR / VR / CNN research

- The biggest team in Nvidia is 2000 - drivers

- Ofc he refused to reply when new generation GPUs will be released and whether the rumour about their current generation GPUs being not produced anymore is true

- He says they are mostly software company focusing on drivers

- Each generation cycle takes 3 years, Nvidia has only one architecture per generation, all the CUDA / ML stuff was planned in 2012-2014

- A rumour about Google TPU. Google has an internal quota - allegedly (!) they cannot buy more GPUs than TPUs, but TPUs are 1% utilized and allegedly they lure Nvidia people to optimize their GPUs to make sure they use this quota efficently

- AMD R&D spend on both CPU and GPU is less than Nvidia spend on GPU

- He says that newest AMD have more 30-40% FLOPs, but they are compared against previous generation consumer GT cards on synthetic tests. AMD does not have a 2000 people driver team...

- He says that Intel has 3-5 new architectures in the works - which may a problem


snakers4 (Alexander), February 10, 09:02

Took a first stab on playing with XGB on GPU and updated my Dockerfile


May not work, but below links / snippet may help




# complile with GPU support

git clone --recursive &&

cd xgboost &&

mkdir build &&

cd build &&

cmake .. -DUSE_CUDA=ON &&

make -j &&

cd ../ &&

# install python package

cd python-package &&

python3 install &&

cd ../

# test all is ok


python3 tests/benchmark/

Also LightGBM depends on old drivers, and does not work (yet) with nvidia-390 on Ubuntu (yet).


Dockerfile update

snakers4 (Alexander), February 10, 07:13

So we started publishing articles / code / solutions to the recent SpaceNet3 challenge. A Russian article on will also be published soon.

- The original article

- The original code release

... and Jeremy Howard from retweeted our solution, lol



But to give some idea which pain the TopCoder platform induces on the contestants, you can read

- Data Download guide

- Final testing guide

- Code release for their verification process




How we participated in SpaceNet three Road Detector challenge

This article tells about our SpaceNet Challenge participation, semantic segmentation in general and transforming masks into graphs Статьи автора - Блог -

snakers4 (Alexander), February 08, 09:13 lesson 11 notes:

- Links

-- Video


- Semantic embeddings + imagenet can be powerful, but not deployable per se

- Training nets on smaller images usually works

- Comparing activation functions

- lr annealing

- linear learnable colour swap trick

- adding Batchnorm

- replacing max-pooling with avg_pooling

- lr vs batch-size

- dealing with noisy labels

- FC / max-pooling layer models are better for transfer-learning?

- size vs. flops vs. speed

- cyclical learning rate paper

- Some nice intuitions about mean shift clustering





Lesson 11: Cutting Edge Deep Learning for Coders
We’ve covered a lot of different architectures, training algorithms, and all kinds of other CNN tricks during this course—so you might be wondering: what sho...

Meta research on the CNNs

(also this amazing post

An Analysis of Deep Neural Network Models for Practical Applications

Key findings:

(1) power consumption is independent of batch size and architecture;

(2) accuracy and inference time are in a hyperbolic relationship;

(3) energy constraint = upper bound on the maximum achievable accuracy and model complexity;

(4) the number of operations is a reliable estimate of the inference time


- Accuracy and param number -

- Param efficiency -

Also a summary of architectural patterns



Deep Learning Scaling is Predictable, Empirically




- various empirical learning curves show robust power-law region

- new architectures slightly shift learning curves downwards

- model architecture exploration should be feasible with small training data sets

- it can be difficult to ensure that training data is large enough to see the power-law learning curve region

- irreducible error region

- each new hardware generation with improved FLOP rate can pro- vide a predictable step function improvement in relative DL model accuracy



Neural Network Architectures

Deep neural networks and Deep Learning are powerful and popular algorithms. And a lot of their success lays in the careful design of the…

snakers4 (Alexander), February 08, 05:20

Looks useless...but so cool!

Maybe in 1-2 years Reinforcement Learning will become a thing



IMPALA - a new and efficient distributed architecture capable of solving many tasks at the same time in DeepMind Lab. - blog - paper - the new DMLab-30 environments @GitHub

snakers4 (Alexander), February 07, 14:09

Following our blog post, we also posted a Russian translation of the Jungle competition to habrhabr




Соревнование Pri-matrix Factorization на DrivenData с 1ТБ данных — как мы заняли 3 место (перевод)

Привет, Хабр! Представляю вашему вниманию перевод статьи "Animal detection in the jungle — 1TB+ of data, 90%+ accuracy and 3rd place in the competition". Или...

snakers4 (Alexander), February 07, 13:04

A bokeh based library to visualize huge datasets





datashader - Turns even the largest data into images, accurately.

snakers4 (Alexander), February 06, 13:40

If, by any change you will have to pass sequential args in bash (```sh /param/one /param/two```) to pass them to python, this will help.

- bash

python3 --params $*

- python

import argparse

if __name__ == '__main__':

parser = argparse.ArgumentParser()

parser.add_argument('--params', nargs = '*', dest = 'params', help = 'topcoder args', default = argparse.SUPPRESS)

args = parser.parse_args()



snakers4 (Alexander), February 06, 05:23

We are starting to publish our code / solutions / articles from recent competitions (Jungle and SpaceNet three).

This time the code will be more polished / idiomatic, so that you can learn something from it!

Jungle competition

- Finally it was verified that we indeed won the 3rd place)


Blog posts



- An adaptation for will be coming soon

Code release and architecture:

- Code

- Architecture

-- 1st place (kudos to Dmytro) - simple and nice

-- Ours

-- 2nd place - 4-5 levels of stacking

Please comment under posts / share / buy us a coffee!

- Buy a coffee

- Rate our channel tg://resolve?domain=tchannelsbot&start=snakers4




Pri-matrix Factorization

Chimp&See has collected nearly 8,000 hours of footage reflecting chimpanzee habitats from camera traps across Africa. Your challenge is to build a model that identifies the wildlife in these videos.

snakers4 (Alexander), February 05, 15:46

Interesting thoughts from Deep Learning Summit 2018

- Brif history of Deep Learning


- Malware detection - code => CNN - supposedly 90% accuracy

- Uber uses LSTMs to model week-level driver data on city level

- Uber metrics

-- 4 Billion Trips in 2017

-- 15 Million Uber trips per day

-- 75 Million monthly active riders

-- 600+ cities across 78 countries

- Uber fraud prevention

-- Anomaly detection (when account is stolen) - city2vec, LSTM + MLP, tops 70% precision / 50-60% recall

-- Card image recognition and detection => via TF CNN models

- Facebook

-- The following models are deployed on mobile - object detection, style transfer, mask rcnn

-- Low bit-width networks (<8 bits)

- Google

-- Unsupervised video next action prediction ~ 70% accuracy

-- SoundNet - classifying sounds - 70% accuracy

- GAN applications

-- Simulated environments and training data

-- Missing data

-- Semi-supervised learning

-- Multiple correct answers

-- Realistic generation tasks

- Slack

-- used to learn embeddings


Schedule | Deep Learning Summit

RE•WORK events combine entrepreneurship, technology and science to solve some of the world's greatest challenges using emerging technology. We showcase the opportunities of exponentially accelerating technologies and their impact on business and society.

snakers4 (Alexander), February 05, 14:57

We also managed to get into top-10 in SpaceNet3 Road Detection challenge


(Final confirmation awaits)

Here is a sneak peak of our solution


A blog post + repo will follow





Flowchart Maker & Online Diagram Software is a free online diagramming application and flowchart maker . You can use it to create UML, entity relationship, org charts, BPMN and BPM, database schema and networks. Also possible are telecommunication network, workflow, flowcharts, maps overlays and GIS, electronic circuit and social network diagrams.

snakers4 (Alexander), February 05, 10:51

LDA is another technique used for topic mining (like NMF) but based on probabilistic graphical models


Topic Modeling with Scikit Learn

Latent Dirichlet Allocation (LDA) is a algorithms used to discover the topics that are present in a corpus. A few open source libraries…

snakers4 (Alexander), February 02, 08:42

Very interesting high quality dataset with satellite images + LB


Just scientific interest.



snakers4 (Alexander), February 01, 11:25

2017 DS/ML digest 2


- One more RL library (last year saw 1 or 2)

- Speech recognition from facebook -

- Even better speech generation than WaveNet - - I cannot tell computer apart

Industry (overdue news)

- Nvidia does not like it's consumer GPUs deployed in data centers

- Clarifai kills forevery

- Google search and gorillas vs. black people -

Blog posts

- Baidu - dataset size vs. accuracy (log-scale)




- New Youtube actions dataset -


Papers - current topic - meta learning / CNN optimization and tricks

- Systematic evaluation of CNN advances on the ImageNet




- Cyclical Learning Rates for Training Neural Networks





- Large batch => train Imagenet in 15 mins


- Practical analysis of CNNs





snakers4 (Alexander), February 01, 09:31

Cyclical Learning rates are not merged in Pytorch yet, but they are in the PR stage




Adds Cyclical Learning Rates by thomasjpfan · Pull Request #2016 · pytorch/pytorch

Adds feature requested in #1909. Mimics the parameters from Since Cyclical Learning Rate (CLR) requires updating the learning rate after every batch, I added batc...

snakers4 (Alexander), January 31, 10:40

Interesting intutions to understand the mean-shift algorithm


Downside - sklearn implementation is slow, you will have to write your own GPU implementation.


Mean Shift Clustering Overview

An overview of mean shift clustering (one of my favorite algorithms) and some of its strengths and weaknesses.

snakers4 (Alexander), January 30, 11:01

Some advice on using UMAP algorithm properly from the author



Multi CPU / GPU capabilities? · Issue #37 · lmcinnes/umap

@lmcinnes As you may have guessed I have several CPUs and GPUs at hand and I work with high-dimensional data. Now I am benching a 500k * 5k => 500k * 2 vector vs. PCA (I need a high level clusterin...

snakers4 (Alexander), January 30, 04:57

Simple Keras + web service deploy guidelines from FChollet + PyImageSearch




Also an engineer guy from our team told me that this architecture sucks on high loads because redis will require object serialization, which takes a lot of time for images. Native python process management works better.



snakers4 (Alexander), January 29, 04:45

Classic / basic CNN papers

Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)

- Authors Xie Saining / Girshick Ross / Dollár Piotr / Tu Zhuowen / He Kaiming

- Link

- Resnet and VGG go deeper

- Inception nets go wider. Despite efficiency - they are hard to re-purpose and design

- key idea - add group convolutions to the residual block

- illustrations

-- basic building block

-- same block in terms of group convolutions

-- overall architecture

-- performance - - +1% vs resnet



snakers4 (Alexander), January 28, 11:50

Dockerfile update for CUDA9 - CUDNN7:


Hello world in PyTorch and tensorflow seem to be working.



Dockerfile update

snakers4 (Alexander), January 26, 04:59

New dimensionality reduction technique - UMAP


I will write more as I test it / learn more.

Works well with HDBSCAN and CNNs I guess


Usage examples




umap - Uniform Manifold Approximation and Projection

snakers4 (Alexander), January 25, 04:15

Playing with HDBSCAN in practice.

What I learned. If you have a non-sparse feature vector, i.e. 1000+ - 5000+ dimensions, then you should use PCA before using HDBSCAN.

Their scalability how-to ( does all the benchmarks on 10 dimension vectors. In practice anything above 50-100 dimensions faced some kind of bottle-neck - the memory consumption was low, the CPU consumption was also low - but nothing pretty much happened for hours.

Also if you want to have large clusters and set ( min_samples value to >> 100, then there will me a memory explosion due to some kind of caching issue. So if your cluster size should be 5000+, then you are compelled to use min_samples ~ 100.


snakers4 (Alexander), January 24, 04:57

Kaggle stats for 2017


I literally choked when I read this:

- $1.5MM competition with TSA to identify threat objects from body scans

- this was a competition where only US citizens were granted prizes => 10 stacked resnets won

- $1.2MM competition with Zillow to improve the Zestimate home valuation algorithm - this has 2 stages, first stage prize was US$50k

- $1MM competition with NIH and Booz Allen to diagnose lung cancer from CT scans - this one was really great - but I did not know much back then, it was early 2017 =(

Also I am not a great data scientist per se, but just comparing the amount of cringe and shitty train/test splits - DrivenData is much better than Kaggle in terms of data SCIENCE.


Reviewing 2017 and Previewing 2018

2017 was a huge year for Kaggle. Aside from joining Google, it also marks the year that our community expanded from being primarily focused on machine [...]

snakers4 (Alexander), January 23, 17:23

Key / classic CNN papers


ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

- a small resnet-like network that uses pointwise separable covolutions and depthwise separable convolutions and a shuffle layer

- authors - Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, Jian Sun

- paper -

- key

-- on ARM devices 13x faster that Alexnet

-- lower top1 error than MobileNet at 40 MFLOPs

- comparable to small versions of NASNET

- 2 ideas

-- use depth-wise separable convolutions for 3x3 and 1x1 convolutions

-- use shuffle layer (flatten, transpose, resize back to original dimension)

- illustrations

-- shuffle idea -

-- building blocks -

-- vs. key architectures

-- vs. MobileNet

-- actual inference on mobile device -



snakers4 (Alexander), January 23, 11:56

A couple more articles about idiomatic pandas



What was useful for me

- Faster reading of the dataframes

- Stack, unstack

- Melt



datas-frame – Modern Pandas (Part 4): Performance

Posts and writings by Tom Augspurger

snakers4 (Alexander), January 23, 06:51

A new interesting competition on topcoder


At least at first glance)



Pre- register now for KONICA MINOLTA Image Segmentation Challenge

This contest aims to create new image recognition technology that could detect abnormality of a product to be used for visual inspection purpose.