Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1317 members, 1587 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

snakers4 (Alexander), July 21, 07:51

Yet another kaggle competition with high prizes and easy challenge


TGS Salt Identification Challenge

Segment salt deposits beneath the Earth's surface

snakers4 (Alexander), July 18, 05:39

Lazy failsafe in PyTorch Data Loader

Sometimes you train a model and testing all the combinations of augmentations / keys / params in your dataloader is too difficult. Or the dataset is too large, so it would take some time to check it properly.

In such cases I usually used some kind of failsafe try/catch.

But looks like even simpler approach works:

if img is None:

# do not return anything



return img



snakers4 (Alexander), July 17, 08:51

Colab SeedBank

- TF is everywhere (naturally) - but at least they use keras

- On the other hand - all of the files are (at least now) downloadable via .ipynb or .py

- So - it may be a good place to look for boilerplate code

Also interesting facts, that are not mentioned openly

- Looks like they use Tesla K80s, which practically are 2.5-3x slower than 1080Ti


- Full screen notebook format is clearly inspired by Jupyter plugins

- Ofc there is a time limit for GPU scripts and GPU availability is not guaranteed (reported by people who used it)

- Personally - it looks a bit like slow instances from FloydHub - time limitations / slow GPU etc/etc

In a nutshell - perfect source of boilerplate code + playground for new people.


Benchmarking Tensorflow Performance and Cost Across Different GPU Options

Machine learning practitioners— from students to professionals — understand the value of moving their work to GPUs . Without one, certain…

snakers4 (Alexander), July 17, 08:32

snakers4 (Alexander), July 16, 17:37

Style Transfer...For Smoke and Fluids! | Two Minute Papers #264
The paper "Example-based Turbulence Style Transfer" is available here: Pick up cool p...

snakers4 (Alexander), July 15, 08:54

Sometimes in supervised ML tasks leveraging the data sctructure in a self-supervised fashion really helps!

Playing with CrowdAI mapping competition

In my opinion it is a good test-ground for testing your ideas with SemSeg - as the dataset is really clean and balanced




Playing with Crowd-AI mapping challenge - or how to improve your CNN performance with self-supervised techniques

In this article I tell about a couple of neat optimizations / tricks / useful ideas that can be applied to many SemSeg / ML tasks Статьи автора - Блог -

snakers4 (spark_comment_bot), July 15, 06:16

Feeding images / tensors of different size using PyTorch dataloader classes

Struggled to do this properly on DS Bowl (I resorted to random crops there for training and 1-image sized batches for validation).

Suppose your dataset has some internal structure in it.

For example - you may have images of vastly different aspect ratios (3x1, 1x3 and 1x1) and you would like to squeeze every bit of performance from your pipeline.

Of course, you may pad your images / center-crop them / random crop them - but in this case you will lose some of the information.

I played with this on some tasks - sometimes force-resize works better than crops, but trying to apply your model convolutionally worked really good on SemSeg challenges.

So it may work very well on plain classification as well.

So, if you apply your model convolutionally, you will end up with differently-sized feature maps for each cluster of images.

Within the model, it can be fixed with:

(0) Adaptive avg pooling layers

(1) Some simple logic in .forward statement of the model

But anyway you end up with a small technical issue - PyTorch cannot concatenate tensors of different sizes using standard collation function.

Theoretically, there are several ways to fix this:

(0) Stupid solution - create N datasets, train on them sequentially.

In practice I tried that on DS Bowl - it worked poorly - the model overfitted to each cluster, and then performed poorly on next one;

(1) Crop / pad / resize images (suppose you deliberately want to avoid that);

(2) Insert some custom logic into PyTorch collattion function, i.e. resize there;

(3) Just sample images so that only images of one size end up within each batch;

(0) and (1) I would like to avoid intentionally.

(2) seems a bit stupid as well, because resizing should be done as a pre-processing step (collation function deals with normalized tensors, not images) and it is better not to mix purposes of your modules

Ofc, you can try to produce N tensors in (2) - i.e. tensor for each image size, but that would require additional loop downstream.

In the end, I decided that (3) is the best approach - because it can be easily transferred to other datasets / domains / tasks.

Long story short - here is my solution - I just extended their sampling function:

Maybe it is worth a PR on Github?

What do you think?



Like this post or have something to say => tell us more in the comments or donate!

[feature request] Support tensors of different sizes as batch elements in DataLoader #1512

Motivating example is returning bounding box annotation for images along with an image. An annotation list can contain variable number of boxes depending on an image, and padding them to a single length (and storing that length) may be n...

snakers4 (Alexander), July 14, 10:57

Once again stumbled upon this amazing PyTorch related post

For those learning PyTorch



Feedback on PyTorch for Kaggle competitions

Hello team, Great work on PyTorch, keep the momentum. I wanted to try my hands on it with the launch of the new MultiLabeling Amazon forest satellite images on Kaggle. Note: new users can only post 2 links in a post so I can’t direct link everything I created the following code as an example this weekend to load and train a model on Kaggle data and wanted to give you my feedback on PyTorch. I hope it helps you. Loading data that is not from a regular dataset like MNIST or CIFAR is confusin...

snakers4 (Alexander), July 13, 17:05

Forwarded from Админим с Буквой:

Git commit messages

Как правильно комиттить в гит. Хорошая статья с хабра:

#thirdparty #read #git

Как следует писать комментарии к коммитам

Предисловие от переводчика На протяжении многих лет разработки ПО, будучи участником многих команд, работая с разными хорошими и опытными людьми, я часто...

snakers4 (Alexander), July 13, 09:15

Tensorboard + PyTorch

6 months ago looked at this - and it was messy

now it looks really polished



tensorboard-pytorch - tensorboard for pytorch (and chainer, mxnet, numpy, ...)

snakers4 (spark_comment_bot), July 13, 05:22

2018 DS/ML digest 17

Highlights of the week

(0) Troubling trends with ML scholars

(1) NLP close to its ImageNet stage?

Papers / posts / articles

(0) Working with multi-modal data

- concatenation-based conditioning

- conditional biasing or scaling ("residual" connections)

- sigmoidal gating

- all in all this approach seems like a mixture of attention / gating for multi-modal problems

(1) Glow, a reversible generative model which uses invertible 1x1 convolutions

(2) Facebooks moonshots - I kind of do not understand much here


(3) RL concept flaws?


(4) Intriguing failures of convolutions - this is fucking amazing

(5) People are only STARTING to apply ML to reasoning

Yet another online book on Deep Learning

(1) Kind of standard!/book/grokking-deep-learning/chapter-1/v-10/1

Libraries / code

(0) Data version control continues to develop




Like this post or have something to say => tell us more in the comments or donate!

Troubling Trends in Machine Learning Scholarship

By Zachary C. Lipton* & Jacob Steinhardt* *equal authorship Originally presented at ICML 2018: Machine

snakers4 (Alexander), July 12, 17:45

DeepMind's AI Learns To See | Two Minute Papers #263
Pick up cool perks on our Patreon page: Crypto and PayPal links are available below. Thank you very much for your gen...

snakers4 (Alexander), July 12, 05:11

Hadoop job in Moscow

No bullshit. The salary is net.


Вакансия Junior / Middle / Senior hadoop developer


Junior / Middle / Senior hadoop developer, 500-1500 р/час net (обсуждается по итогам собеседования).

На основе своего опыта могу сказать, что мы стараемся все делать гибко, быстро, с минимумом буллшита.

Ищем разработчика hadoop на участие в проекте в июне-декабре 2018 года. Возможно, и дальше. В команде имеется тимлид, админы.

В проекте имеется десятки источников больших данных, необходимо разрабатывать интеграции, производить расчет производных витрин.

Говоря простым языком, есть много данных из разных систем, их нужно положить к нам, потом посчитать на их основе что-то, снова посчитать (повторить необходимое количество раз), может быть, переложить по цепочке дальше. В отличие от стандартных ETL, фишка в экосистеме и в объеме данных.

В приоритете - работа full-time в офисе, но могут быть рассмотрены варианты удаленной работы.

Загрузка может варьироваться, 30-50 часов в неделю. Мы ожидаем, что это ваша основная работа. Совместительство с full-time работой не подходит.


- понимание общей архитектуры hadoop

- git, хорошие практики разработки

- уверенная работа с Linux

- знание SQL (Hive)

- знание Spark (python)


- знакомство с HortonWorks Data Platform

- Java

Дополнительное пояснение: если вы джун, который хочет получить экспертизу в достаточно закрытом на текущий момент домене, в котором сейчас не хватает людей на рынке, это хороший шанс. От джуна требуется знание хотя бы python, умение в linux и git, инициативность и желание обучаться, в том числе самостоятельно.

Контакт: @dilyara_tchk, и нет, я не hr.

snakers4 (Alexander), July 11, 16:26

Looks cool

But Mendeley so far looks better


Scholarcy reads and summarises research papers, generates a background reading list, highlight important sentences, and finds open access versions of referenced papers.  

snakers4 (Alexander), July 11, 06:51

TF 1.9

Funnily enough, they call Keras not "Keras with TF back-end", but "tf.keras"




An Open Source Machine Learning Framework for Everyone - tensorflow/tensorflow

snakers4 (Alexander), July 10, 04:35

Ofc such experiments are done on toy datasets - but it's nice to know

Forwarded from Just links:

Adaptive Blending Units: Trainable Activation Functions for Deep Neural Networks

Forwarded from Hacker News:

NLP's ImageNet moment has arrived (Score: 100+ in 5 hours)



NLP's ImageNet moment has arrived

The time is ripe for practical transfer learning to make inroads into NLP.

snakers4 (Alexander), July 09, 16:17

An Introduction to Hashing in the Era of Machine Learning

In December 2017, researchers at Google and MIT published a provocative research paper about their efforts into “learned index structures”…

snakers4 (Alexander), July 09, 09:04

2018 DS/ML digest 16

Papers / posts

(0) RL now solves Quake

(1) A post about AdamW

-- Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam

-- Amsgrad turns out to be very disappointing

-- Refresher article

(2) How to tackle new classes in CV

(3) A new word in GANs?



(4) Using deep learning representations for search


-- library for fast search on python

(5) One more paper on GAN convergence

(6) Switchable normalization - adds a bit to ResNet50 + pre-trained models


(0) Disney starts to release datasets

Market / interesting links

(0) A motion to open-source GitHub

(1) Allegedly GTX 1180 start in sales appearing in Asia (?)

(2) Some controversy regarding Andrew Ng and self-driving cars

(3) National AI strategies overviewed -

-- Canada C$135m

-- China has the largest strategy

-- Notably - countries like Finland also have one

(4) Amazon allegedly sells face recognition to the USA



Google’s DeepMind taught AI teamwork by playing Quake III Arena

Google’s DeepMind today shared the results of training multiple AI systems to play Capture the Flag on Quake III Arena, a multiplayer first-person shooter game. The AI played nearly 450,000 g…

snakers4 (Alexander), July 08, 17:47

Yet another proxy - shadowsocks

If someone needs another proxy guide, someone with an Arabic username shared some alternative advice for proxy configuration

- (wait a bit till link resolves)



Playing with a simple SOCKS5 proxy server on Digital Ocean and Ubuntu 16

This article tells you how to start your SOCKS5 proxy with zero to little experience Статьи автора - Блог -

snakers4 (Alexander), July 08, 06:16

Disclaimer - it does not support pivot tables or complicated group_by ...

snakers4 (Alexander), July 08, 06:06

A new multi-threaded addition to pandas stack?

Read about this some time ago (when this was just in development // - found essentially 3 alternatives

- just being clever about optimizing your operations + using what is essentially a multi-threaded map/reduce in pandas //

- pandas on ray

- dask (overkill)





So...I ran a test in the notebook I had on hand. It works. More tests will be done in future.



Spark in me - Internet, data science, math, deep learning, philosophy

Pandas on Ray - RISE Lab

snakers4 (Alexander), July 08, 04:52

Convolutional Network Demo from 1993
This is a demo of "LeNet 1", the first convolutional network that could recognize handwritten digits with good speed and accuracy. It was developed between 1...

snakers4 (spark_comment_bot), July 07, 12:29

Playing with VAEs and their practical use

So, I played a bit with Variational Auto Encoders (VAE) and wrote a small blog post on this topic

Please like, share and repost!



Like this post or have something to say => tell us more in the comments or donate!

Playing with Variational Auto Encoders - PCA vs. UMAP vs. VAE on FMNIST / MNIST

In this article I thoroughly compare the performance of VAE / PCA / UMAP embeddings on a simplistic domain - UMAP Статьи автора - Блог -

snakers4 (Alexander), July 07, 08:27

Epicycles, complex Fourier series and Homer Simpson's orbit
Today’s video was motivated by an amazing animation of a picture of Homer Simpson being drawn using epicycles. This video is about making sense of the mathem...

snakers4 (Alexander), July 06, 15:33

Forwarded from Админим с Буквой:

Bash shortcuts

Написал микро лабораторную работу для обучения хоткеям в bash.

#bash_tips_and_tricks #junior

bash shortcuts

Небольшая лабораторка по изучению основных хоткеев в bash. Подготовьте себе вот такую строку:

snakers4 (Alexander), July 05, 05:54

XGB - now on GPU properly?

Joshua Patterson

#XGBoost is faster than ever, with better scaling, on #GPU thanks to the hard work of @nvidia & @h2oai! Check out the latest paper, and more is coming very soon! #lightgbm #catboost #GBDT

snakers4 (Alexander), July 04, 16:56

Forwarded from Just links:


You want to know something about how bullshit insane our brains are? OK, so there's a physical problem with our eyes: We move them in short fast bursts called "saccades", right? very quick, synchronized movements. The only problem is: they go all blurry and useless during this

snakers4 (Alexander), July 04, 14:16

Machine Learning Research & Interpreting Neural Networks
Machine learning and neural networks change how computers and humans interact, but they can be complicated to understand. In this episode of Coffee with a Go...

snakers4 (Alexander), July 04, 08:11

Forwarded from Savva Kolbachev:

Comparing Sentence Similarity Methods

Natural Language Processing Consultancy and Development.

snakers4 (Alexander), July 04, 07:57

2018 DS/ML digest 15

What I filtered through this time

Market / news

(0) Letters by big company employees against using ML for weapons

- Microsoft

- Amazon

(1) Facebook open sources Dense Pose (eseentially this is Mask-RCNN)


Papers / posts / NLP

(0) One more blog post about text / sentence embeddings

- key idea different weighting

(1) One more sentence embedding calculation method

- ?

(2) Posts explaing NLP embeddings

- - some basics - SVD / Word2Vec / GloVe

-- SVD improves embedding quality (as compared to ohe)?

-- use log-weighting, use TF-IDF weighting (the above weighting)

- - word embedding properties

-- dimensions vs. embedding quality

(3) Spacy + Cython = 100x speed boost - - good to know about this as a last resort

- described use-case

you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow

or you have a heavy processing logic in your DeepLearning batch loader that slows down your training

(4) Once again stumbled upon this -

(5) Papers

- Simple NLP embedding baseline

- NLP decathlon for question answering

- Debiasing embeddings

- Once again transfer learning in NLP by open-AI -




Download full.pdf 0.04 MB