Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1319 members, 1513 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

snakers4 (Alexander), July 15, 08:54

Sometimes in supervised ML tasks leveraging the data sctructure in a self-supervised fashion really helps!

Playing with CrowdAI mapping competition

In my opinion it is a good test-ground for testing your ideas with SemSeg - as the dataset is really clean and balanced




Playing with Crowd-AI mapping challenge - or how to improve your CNN performance with self-supervised techniques

In this article I tell about a couple of neat optimizations / tricks / useful ideas that can be applied to many SemSeg / ML tasks Статьи автора - Блог -

snakers4 (Alexander), July 13, 17:05

Forwarded from Админим с Буквой:

Git commit messages

Как правильно комиттить в гит. Хорошая статья с хабра:

#thirdparty #read #git

Как следует писать комментарии к коммитам

Предисловие от переводчика На протяжении многих лет разработки ПО, будучи участником многих команд, работая с разными хорошими и опытными людьми, я часто...

snakers4 (Alexander), July 13, 09:15

Tensorboard + PyTorch

6 months ago looked at this - and it was messy

now it looks really polished



tensorboard-pytorch - tensorboard for pytorch (and chainer, mxnet, numpy, ...)

snakers4 (spark_comment_bot), July 13, 05:22

2018 DS/ML digest 17

Highlights of the week

(0) Troubling trends with ML scholars

(1) NLP close to its ImageNet stage?

Papers / posts / articles

(0) Working with multi-modal data

- concatenation-based conditioning

- conditional biasing or scaling ("residual" connections)

- sigmoidal gating

- all in all this approach seems like a mixture of attention / gating for multi-modal problems

(1) Glow, a reversible generative model which uses invertible 1x1 convolutions

(2) Facebooks moonshots - I kind of do not understand much here


(3) RL concept flaws?


(4) Intriguing failures of convolutions - this is fucking amazing

(5) People are only STARTING to apply ML to reasoning

Yet another online book on Deep Learning

(1) Kind of standard!/book/grokking-deep-learning/chapter-1/v-10/1

Libraries / code

(0) Data version control continues to develop




Like this post or have something to say => tell us more in the comments or donate!

Troubling Trends in Machine Learning Scholarship

By Zachary C. Lipton* & Jacob Steinhardt* *equal authorship Originally presented at ICML 2018: Machine

snakers4 (Alexander), July 12, 17:45

DeepMind's AI Learns To See | Two Minute Papers #263
Pick up cool perks on our Patreon page: PayPal and crypto links are available below. Thank you very much for your gen...

snakers4 (Alexander), July 12, 05:11

Hadoop job in Moscow

No bullshit. The salary is net.


Вакансия Junior / Middle / Senior hadoop developer


Junior / Middle / Senior hadoop developer, 500-1500 р/час net (обсуждается по итогам собеседования).

На основе своего опыта могу сказать, что мы стараемся все делать гибко, быстро, с минимумом буллшита.

Ищем разработчика hadoop на участие в проекте в июне-декабре 2018 года. Возможно, и дальше. В команде имеется тимлид, админы.

В проекте имеется десятки источников больших данных, необходимо разрабатывать интеграции, производить расчет производных витрин.

Говоря простым языком, есть много данных из разных систем, их нужно положить к нам, потом посчитать на их основе что-то, снова посчитать (повторить необходимое количество раз), может быть, переложить по цепочке дальше. В отличие от стандартных ETL, фишка в экосистеме и в объеме данных.

В приоритете - работа full-time в офисе, но могут быть рассмотрены варианты удаленной работы.

Загрузка может варьироваться, 30-50 часов в неделю. Мы ожидаем, что это ваша основная работа. Совместительство с full-time работой не подходит.


- понимание общей архитектуры hadoop

- git, хорошие практики разработки

- уверенная работа с Linux

- знание SQL (Hive)

- знание Spark (python)


- знакомство с HortonWorks Data Platform

- Java

Дополнительное пояснение: если вы джун, который хочет получить экспертизу в достаточно закрытом на текущий момент домене, в котором сейчас не хватает людей на рынке, это хороший шанс. От джуна требуется знание хотя бы python, умение в linux и git, инициативность и желание обучаться, в том числе самостоятельно.

Контакт: @dilyara_tchk, и нет, я не hr.

snakers4 (Alexander), July 11, 16:26

Looks cool

But Mendeley so far looks better


Scholarcy reads and summarises research papers, generates a background reading list, highlight important sentences, and finds open access versions of referenced papers.  

snakers4 (Alexander), July 11, 06:51

TF 1.9

Funnily enough, they call Keras not "Keras with TF back-end", but "tf.keras"




tensorflow - Computation using data flow graphs for scalable machine learning

snakers4 (Alexander), July 10, 04:38

Forwarded from Hacker News:

NLP's ImageNet moment has arrived (Score: 100+ in 5 hours)



NLP's ImageNet moment has arrived

The time is ripe for practical transfer learning to make inroads into NLP.

snakers4 (Alexander), July 09, 09:04

2018 DS/ML digest 16

Papers / posts

(0) RL now solves Quake

(1) A post about AdamW

-- Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam

-- Amsgrad turns out to be very disappointing

-- Refresher article

(2) How to tackle new classes in CV

(3) A new word in GANs?



(4) Using deep learning representations for search


-- library for fast search on python

(5) One more paper on GAN convergence

(6) Switchable normalization - adds a bit to ResNet50 + pre-trained models


(0) Disney starts to release datasets

Market / interesting links

(0) A motion to open-source GitHub

(1) Allegedly GTX 1180 start in sales appearing in Asia (?)

(2) Some controversy regarding Andrew Ng and self-driving cars

(3) National AI strategies overviewed -

-- Canada C$135m

-- China has the largest strategy

-- Notably - countries like Finland also have one

(4) Amazon allegedly sells face recognition to the USA



Google’s DeepMind taught AI teamwork by playing Quake III Arena

Google’s DeepMind today shared the results of training multiple AI systems to play Capture the Flag on Quake III Arena, a multiplayer first-person shooter game. The AI played nearly 450,000 g…

snakers4 (Alexander), July 08, 06:16

Disclaimer - it does not support pivot tables or complicated group_by ...

snakers4 (Alexander), July 08, 06:06

A new multi-threaded addition to pandas stack?

Read about this some time ago (when this was just in development - found essentially 3 alternatives

- just being clever about optimizing your operations + using what is essentially a multi-threaded map/reduce in pandas

- pandas on ray

- dask (overkill)





So...I ran a test in the notebook I had on hand. It works. More tests will be done in future.



Spark in me - Internet, data science, math, deep learning, philosophy

Pandas on Ray - RISE Lab

snakers4 (spark_comment_bot), July 07, 12:29

Playing with VAEs and their practical use

So, I played a bit with Variational Auto Encoders (VAE) and wrote a small blog post on this topic

Please like, share and repost!



Like this post or have something to say => tell us more in the comments or donate!

Playing with Variational Auto Encoders - PCA vs. UMAP vs. VAE on FMNIST / MNIST

In this article I thoroughly compare the performance of VAE / PCA / UMAP embeddings on a simplistic domain - UMAP Статьи автора - Блог -

snakers4 (Alexander), July 07, 08:27

Epicycles, complex Fourier and Homer Simpson's orbit
Today’s video was motivated by an amazing animation by Santiago Ginnobili of a picture of Homer Simpson being drawn using epicycles. This video is about maki...

snakers4 (Alexander), July 04, 08:11

Forwarded from :

Comparing Sentence Similarity Methods

Natural Language Processing Consultancy and Development.

snakers4 (Alexander), July 04, 07:57

2018 DS/ML digest 15

What I filtered through this time

Market / news

(0) Letters by big company employees against using ML for weapons

- Microsoft

- Amazon

(1) Facebook open sources Dense Pose (eseentially this is Mask-RCNN)


Papers / posts / NLP

(0) One more blog post about text / sentence embeddings

- key idea different weighting

(1) One more sentence embedding calculation method

- ?

(2) Posts explaing NLP embeddings

- - some basics - SVD / Word2Vec / GloVe

-- SVD improves embedding quality (as compared to ohe)?

-- use log-weighting, use TF-IDF weighting (the above weighting)

- - word embedding properties

-- dimensions vs. embedding quality

(3) Spacy + Cython = 100x speed boost - - good to know about this as a last resort

- described use-case

you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow

or you have a heavy processing logic in your DeepLearning batch loader that slows down your training

(4) Once again stumbled upon this -

(5) Papers

- Simple NLP embedding baseline

- NLP decathlon for question answering

- Debiasing embeddings

- Once again transfer learning in NLP by open-AI -




Download full.pdf 0.04 MB

snakers4 (Alexander), July 04, 05:12

Open Images Object detection on Kaggle


- Key ideas

-- 1.2 images, high-res, 500 classes

-- decent prizes, but short time-span (2 months)

-- object detection


Google AI Open Images - Object Detection Track

Detect objects in varied and complex images.

snakers4 (Alexander), July 03, 07:15

A cool article from Ben Evans about how to think about ML

Ways to think about machine learning

We're now four or five years into the current explosion of machine learning, and pretty much everyone has heard of it, and every big company is working on projects around ‘AI’. We know this is a Next Big Thing. I don't think, though, that we yet have a settled sense of quite what machine learning m

My recent PyTorch 0.4 Dockerfile for CV


My PyTorch 0.4 Dockerfile

snakers4 (Alexander), July 02, 04:51

2018 DS/ML digest 14

Amazing article - why you do not need ML


- I personally love plain-vanilla SQL and in 90% of cases people under-use it

- I even wrote 90% of my JSON API on our blog in pure PostgreSQL xD

Practice / papers

(0) Interesting papers from CVPR

(1) Some down-to-earth obstacles to ML deploy

(2) Using synthetic data for CNNs (by Nvidia) -

(3) This puzzles me - so much effort and engineering spent on something ... strange and useless -

On paper they do a cool thing - investigate transfer learning between different domains, but in practice it is done on TF and there is no clear conclusion of any kind

(4) VAE + real datasets - only small Imagenet (64x64)

(5) Understanding the speed of models deployed on mobile -

(6) A brief overview of multi-modal methods

Visualizations / explanations

(0) Amazing website with ML explanations

(1) PCA and linear VAEs are close




No, you don't need ML/AI. You need SQL

A while ago, I did a Twitter thread about the need to use traditional and existing tools to solve everyday business problems other than jumping on new buzzwords, sexy and often times complicated technologies.

snakers4 (Alexander), July 01, 11:48

Measuring feature importance properly

Once again stumbled upon an amazing article about measuring feature importance for any ML algorithms:

(0) Permutation importance - if your ML algorithm is costly, then you can just shuffle a column and check importance

(1) Drop column importance - drop a column, re-train a model, check performance metrics

Why it is useful / caveats

(0) If you really care about understanding your domain - feature importances are a must have

(1) All of this works only for powerful models

(2) Landmines include - correlated or duplicate variables, data normalization

Correlated variables

(0) For RF - correlated variables share permutation importance roughly proportionally to their correlation

(1) Drop column importance can behave unpredictably

I personally like engineering different kinds of features and doing ablation tests:

(0) Among feature sets, sharing similar purpose

(1) Within feature sets


snakers4 (Alexander), June 28, 15:22

Playing with PyTorch 0.4

It was released some time ago

If you are not aware - this is the best summary

My first-hand experiences

- Multi-GPU support works strangely

- If you just launch your 0.3 code it will work on 0.4 with warnings - not a really breaking change

- All the new features are really cool, useful and make using PyTorch even more delightful

- I especially liked how they added context managers and cleaned up the device mess


snakers4 (Alexander), June 28, 11:18

DL Framework choice - 2018

If you are still new to DL / DS / ML and have not yet chosen your framework, consider reading this before proceeding



snakers4 (Alexander), June 28, 07:59

Forwarded from Hacker News:

Python 3.7 released (Score: 100+ in 2 hours)



Python Release Python 3.7.0

The official home of the Python Programming Language

snakers4 (Alexander), June 28, 07:43

2018 DS/ML digest 13

Blog posts / articles:

(0) Google notes on CNN generalization -

(1) Google to teaching robots in virtual environment and then trasferring models to reality -

(2) Google's object tracking via image colorization -

(2) Interesting articles about VAEs:

- A small intro into VAEs

- A small intuitive intro (super super cool and intuitive)

- KL divergence explained

- A more formal write-up

- In (RU)

- Converting a FC layer into a conv layer

- A post by Fchollet

A good in-depth write-up on object detection:


- finally a decent explanation of YOLO parametrization[email protected]

- best comparison of YOLO and SSD ever -[email protected]

Papers with interesting abstracts (just good to know sich things exist)

- Low-bit CNNs -

- Automated Meta ML -

- Idea - use ResNet blocks for boosting -

- 2D-discrete-Fourier transform (2D-DFT) to encode rotational invariance in neural networks -

- Smallify the CNNs -

- BLEU review as a metric - conclusion - it is good on average to measure MT performance -

"New" ideas in SemSeg:

- UNET + conditional VAE

- Dilated convolutions for larget satellite images - looks like that this works only if you have high resolution with small objects



How Can Neural Network Similarity Help Us Understand Training and Generalization?

Posted by Maithra Raghu, Google Brain Team and Ari S. Morcos, DeepMind In order to solve tasks, deep neural networks (DNNs) progressively...

snakers4 (Alexander), June 26, 19:25

This AI Detects DeepFakes | Two Minute Papers #259
The paper "FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces " is available here:

snakers4 (Alexander), June 26, 07:02

If someone needs a dataset, Kaggle launched ImageNet object detection


There is an open images dataset, which I guess is bigger though


ImageNet Object Localization Challenge

Identify the objects in images

snakers4 (Alexander), June 25, 15:57

Forwarded from Just links:

OpenAI Five

Our team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2.

snakers4 (Alexander), June 25, 10:53

A subscriber sent a really decent CS university scientific ranking

Useful, if you want to apply for CS/ML based Ph.D. there


Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch





pytorch-openai-transformer-lm - A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

snakers4 (Alexander), June 23, 16:46

Better Video Impersonations with AI | Two Minute Papers #258
The paper "Deep Video Portraits" is available here: Pick up cool perks on our Patreon page: ww...

snakers4 (Alexander), June 23, 12:10

Interesting links about Internet

- Ben Evans' digest -

- China plans to track cars -

- Ben Evans - content is not king anymore - distribution / eco-system are

- Google opens AI center in Ghana -

- (RU) A funny case on censorship in Russia - funny article deleted from habr -

-- It kind of clearly shows that you cannot safely post anything to habr

- India + WhatsApp + lynch mobs -

- Tor foundation about web-tracking and Facebook -

- Docker image jacking for crypto-mining -

- Ethereum - 75% transactions automated bots -

- (RU) - analyzing fake elections in Russia - 3-10M votes are fake -


2018 DS/ML digest 12

As usual, this is whatever I found really interesting / worth reading.

Implementations / papers / ideas


You can count bees well with UNet -


A really super cool idea - use affine transformations in 3D to stack augmentations on the level of transformation matrices

(3D augs are costly)


- both for rotation and scaling

- note a couple of things for easier understanding:

-- there is offset in tranformations - because the coordinate center is not in "center"

-- zoom essentially scales unit vectors after applying the offset

- 3Blue1Brown videos about linear algebra -


A top solution from Google's Landmark Challenge -


- ensemble of features / skip connections from a CNN (ResNeXt)


- use KNN + augment the extracted features by averaging with similar images

- query expansion (use the fact that different crops of the same landmark remain the same landmark)


(RU) A super cool series about interestring clustering algorithms

- Affinity propagation




- (spoiler - in practice use awesome HDBSCAN library)


Brief review of image super-resolution techniques


- In a nutshell try in this order FCN CNNs, auto-encoders with skip connections or GANs


SOTA NLP by open-ai

Key ideas

- Train a transformer language models on large corpus in an unsupervised way

- Fine-tune on a smaller task

- Profit


- "Our approach requires an expensive pre-training step - 1 month on 8 GPUs" (probably this should be discounted somewhat)

- TF and unreadable enterprise code


One more claimed SOTA word embedding set


A cool github page by Sebastian Ruder to track major NLP tasks



Amazing visual explanations of how decision trees work


- it explains visually how overfitting occurs in decisions tree models


CIFAR T-SNE can be done in real-time on the GPU + tensorflow.js integration

- Blog

- Website

- Arxiv -

- Demo -

(2) Why people fail to use d3.js -


(0) Nice idea - use available tools and videos to collect datasets