Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1319 members, 1513 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

snakers4 (Alexander), July 15, 08:54

Sometimes in supervised ML tasks leveraging the data sctructure in a self-supervised fashion really helps!

Playing with CrowdAI mapping competition

In my opinion it is a good test-ground for testing your ideas with SemSeg - as the dataset is really clean and balanced

spark-in.me/post/a-small-case-for-search-of-structure-within-your-data

#deep_learning

#data_science

#satellite_imaging

Playing with Crowd-AI mapping challenge - or how to improve your CNN performance with self-supervised techniques

In this article I tell about a couple of neat optimizations / tricks / useful ideas that can be applied to many SemSeg / ML tasks Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), July 13, 17:05

Forwarded from Админим с Буквой:

Git commit messages

Как правильно комиттить в гит. Хорошая статья с хабра:

habr.com/post/416887/

#thirdparty #read #git

Как следует писать комментарии к коммитам

Предисловие от переводчика На протяжении многих лет разработки ПО, будучи участником многих команд, работая с разными хорошими и опытными людьми, я часто...


snakers4 (Alexander), July 13, 09:15

Tensorboard + PyTorch

6 months ago looked at this - and it was messy

now it looks really polished

github.com/lanpa/tensorboard-pytorch

#data_science

lanpa/tensorboard-pytorch

tensorboard-pytorch - tensorboard for pytorch (and chainer, mxnet, numpy, ...)


snakers4 (spark_comment_bot), July 13, 05:22

2018 DS/ML digest 17

Highlights of the week

(0) Troubling trends with ML scholars

approximatelycorrect.com/2018/07/10/troubling-trends-in-machine-learning-scholarship/

(1) NLP close to its ImageNet stage?

thegradient.pub/nlp-imagenet/

Papers / posts / articles

(0) Working with multi-modal data distill.pub/2018/feature-wise-transformations/

- concatenation-based conditioning

- conditional biasing or scaling ("residual" connections)

- sigmoidal gating

- all in all this approach seems like a mixture of attention / gating for multi-modal problems

(1) Glow, a reversible generative model which uses invertible 1x1 convolutions

blog.openai.com/glow/

(2) Facebooks moonshots - I kind of do not understand much here

- research.fb.com/facebook-research-at-icml-2018/

(3) RL concept flaws?

- thegradient.pub/why-rl-is-flawed/

(4) Intriguing failures of convolutions

eng.uber.com/coordconv/ - this is fucking amazing

(5) People are only STARTING to apply ML to reasoning

deepmind.com/blog/measuring-abstract-reasoning/

Yet another online book on Deep Learning

(1) Kind of standard livebook.manning.com/#!/book/grokking-deep-learning/chapter-1/v-10/1

Libraries / code

(0) Data version control continues to develop dvc.org/features

#deep_learning

#data_science

#digest

Like this post or have something to say => tell us more in the comments or donate!

Troubling Trends in Machine Learning Scholarship

By Zachary C. Lipton* & Jacob Steinhardt* *equal authorship Originally presented at ICML 2018: Machine


snakers4 (Alexander), July 12, 17:45

youtu.be/gnctSz2ofU4

DeepMind's AI Learns To See | Two Minute Papers #263
Pick up cool perks on our Patreon page: www.patreon.com/TwoMinutePapers PayPal and crypto links are available below. Thank you very much for your gen...

snakers4 (Alexander), July 12, 05:11

Hadoop job in Moscow

No bullshit. The salary is net.

telegra.ph/Vakansiya-Junior--Middle--Senior-hadoop-developer-07-12

#jobs

Вакансия Junior / Middle / Senior hadoop developer

dilyara_tchk

Junior / Middle / Senior hadoop developer, 500-1500 р/час net (обсуждается по итогам собеседования).


На основе своего опыта могу сказать, что мы стараемся все делать гибко, быстро, с минимумом буллшита.

Ищем разработчика hadoop на участие в проекте в июне-декабре 2018 года. Возможно, и дальше. В команде имеется тимлид, админы.

В проекте имеется десятки источников больших данных, необходимо разрабатывать интеграции, производить расчет производных витрин.

Говоря простым языком, есть много данных из разных систем, их нужно положить к нам, потом посчитать на их основе что-то, снова посчитать (повторить необходимое количество раз), может быть, переложить по цепочке дальше. В отличие от стандартных ETL, фишка в экосистеме и в объеме данных.

В приоритете - работа full-time в офисе, но могут быть рассмотрены варианты удаленной работы.

Загрузка может варьироваться, 30-50 часов в неделю. Мы ожидаем, что это ваша основная работа. Совместительство с full-time работой не подходит.

Требования:

- понимание общей архитектуры hadoop

- git, хорошие практики разработки

- уверенная работа с Linux

- знание SQL (Hive)

- знание Spark (python)

Приветствуется:

- знакомство с HortonWorks Data Platform

- Java

Дополнительное пояснение: если вы джун, который хочет получить экспертизу в достаточно закрытом на текущий момент домене, в котором сейчас не хватает людей на рынке, это хороший шанс. От джуна требуется знание хотя бы python, умение в linux и git, инициативность и желание обучаться, в том числе самостоятельно.

Контакт: @dilyara_tchk, и нет, я не hr.


snakers4 (Alexander), July 11, 16:26

www.scholarcy.com/

Looks cool

But Mendeley so far looks better

Home

Scholarcy reads and summarises research papers, generates a background reading list, highlight important sentences, and finds open access versions of referenced papers.  


snakers4 (Alexander), July 11, 06:51

TF 1.9

github.com/tensorflow/tensorflow/releases/tag/v1.9.0

Funnily enough, they call Keras not "Keras with TF back-end", but "tf.keras"

xD

#deep_learning

tensorflow/tensorflow

tensorflow - Computation using data flow graphs for scalable machine learning


snakers4 (Alexander), July 10, 04:38

Forwarded from Hacker News:

NLP's ImageNet moment has arrived (Score: 100+ in 5 hours)

Link: readhacker.news/s/3MB3E

Comments: readhacker.news/c/3MB3E

NLP's ImageNet moment has arrived

The time is ripe for practical transfer learning to make inroads into NLP.


snakers4 (Alexander), July 09, 09:04

2018 DS/ML digest 16

Papers / posts

(0) RL now solves Quake

venturebeat.com/2018/07/03/googles-deepmind-taught-ai-teamwork-by-playing-quake-iii-arena/

(1) A fast.ai post about AdamW

www.fast.ai/2018/07/02/adam-weight-decay/

-- Adam generally requires more regularization than SGD, so be sure to adjust your regularization hyper-parameters when switching from SGD to Adam

-- Amsgrad turns out to be very disappointing

-- Refresher article ruder.io/optimizing-gradient-descent/index.html#nadam

(2) How to tackle new classes in CV

petewarden.com/2018/07/06/what-image-classifiers-can-do-about-unknown-objects/

(3) A new word in GANs?

-- ajolicoeur.wordpress.com/RelativisticGAN/

-- arxiv.org/pdf/1807.00734.pdf

(4) Using deep learning representations for search

-- goo.gl/R1vhTh

-- library for fast search on python github.com/spotify/annoy

(5) One more paper on GAN convergence

avg.is.tuebingen.mpg.de/publications/meschedericml2018

(6) Switchable normalization - adds a bit to ResNet50 + pre-trained models

github.com/switchablenorms/Switchable-Normalization

Datasets

(0) Disney starts to release datasets

www.disneyanimation.com/technology/datasets

Market / interesting links

(0) A motion to open-source GitHub

github.com/dear-github/dear-github/issues/304

(1) Allegedly GTX 1180 start in sales appearing in Asia (?)

(2) Some controversy regarding Andrew Ng and self-driving cars goo.gl/WNW4E3

(3) National AI strategies overviewed - goo.gl/BXDCD7

-- Canada C$135m

-- China has the largest strategy

-- Notably - countries like Finland also have one

(4) Amazon allegedly sells face recognition to the USA goo.gl/eDzekn

#data_science

#deep_learning

Google’s DeepMind taught AI teamwork by playing Quake III Arena

Google’s DeepMind today shared the results of training multiple AI systems to play Capture the Flag on Quake III Arena, a multiplayer first-person shooter game. The AI played nearly 450,000 g…


snakers4 (Alexander), July 08, 06:16

Disclaimer - it does not support pivot tables or complicated group_by ...

snakers4 (Alexander), July 08, 06:06

A new multi-threaded addition to pandas stack?

Read about this some time ago (when this was just in development snakers41.spark-in.me/1850) - found essentially 3 alternatives

- just being clever about optimizing your operations + using what is essentially a multi-threaded map/reduce in pandas snakers41.spark-in.me/1981

- pandas on ray

- dask (overkill)

Links:

(0) rise.cs.berkeley.edu/blog/pandas-on-ray-early-lessons/

(1) www.reddit.com/comments/8wuz7e

(2) github.com/modin-project/modin

So...I ran a test in the notebook I had on hand. It works. More tests will be done in future.

pics.spark-in.me/upload/2c7a2f8c8ce1dd7a86a54ec3a3dcf965.png

#data_science

#pandas

Spark in me - Internet, data science, math, deep learning, philosophy

Pandas on Ray - RISE Lab https://rise.cs.berkeley.edu/blog/pandas-on-ray/


snakers4 (spark_comment_bot), July 07, 12:29

Playing with VAEs and their practical use

So, I played a bit with Variational Auto Encoders (VAE) and wrote a small blog post on this topic

spark-in.me/post/playing-with-vae-umap-pca

Please like, share and repost!

#deep_learning

#data_science

Like this post or have something to say => tell us more in the comments or donate!

Playing with Variational Auto Encoders - PCA vs. UMAP vs. VAE on FMNIST / MNIST

In this article I thoroughly compare the performance of VAE / PCA / UMAP embeddings on a simplistic domain - UMAP Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), July 07, 08:27

youtu.be/qS4H6PEcCCA

Epicycles, complex Fourier and Homer Simpson's orbit
Today’s video was motivated by an amazing animation by Santiago Ginnobili of a picture of Homer Simpson being drawn using epicycles. This video is about maki...

snakers4 (Alexander), July 04, 08:11

Forwarded from :

nlp.town/blog/sentence-similarity/

Comparing Sentence Similarity Methods

Natural Language Processing Consultancy and Development.


snakers4 (Alexander), July 04, 07:57

2018 DS/ML digest 15

What I filtered through this time

Market / news

(0) Letters by big company employees against using ML for weapons

- Microsoft

- Amazon

(1) Facebook open sources Dense Pose (eseentially this is Mask-RCNN)

- research.fb.com/facebook-open-sources-densepose/

Papers / posts / NLP

(0) One more blog post about text / sentence embeddings goo.gl/Zm8C2c

- key idea different weighting

(1) One more sentence embedding calculation method

- openreview.net/pdf?id=SyK00v5xx ?

(2) Posts explaing NLP embeddings

- www.offconvex.org/2015/12/12/word-embeddings-1/ - some basics - SVD / Word2Vec / GloVe

-- SVD improves embedding quality (as compared to ohe)?

-- use log-weighting, use TF-IDF weighting (the above weighting)

- www.offconvex.org/2016/02/14/word-embeddings-2/ - word embedding properties

-- dimensions vs. embedding quality www.cs.princeton.edu/~arora/pubs/LSAgraph.jpg

(3) Spacy + Cython = 100x speed boost - goo.gl/9TwVqu - good to know about this as a last resort

- described use-case

you are pre-processing a large training set for a DeepLearning framework like pyTorch/TensorFlow

or you have a heavy processing logic in your DeepLearning batch loader that slows down your training

(4) Once again stumbled upon this - blog.openai.com/language-unsupervised/

(5) Papers

- Simple NLP embedding baseline goo.gl/nGujzS

- NLP decathlon for question answering goo.gl/6HHi7q

- Debiasing embeddings arxiv.org/abs/1806.06301

- Once again transfer learning in NLP by open-AI - goo.gl/82VR4U

#deep_learning

#digest

#data_science

Download full.pdf 0.04 MB

snakers4 (Alexander), July 04, 05:12

Open Images Object detection on Kaggle

- www.kaggle.com/c/google-ai-open-images-object-detection-track#Description

- Key ideas

-- 1.2 images, high-res, 500 classes

-- decent prizes, but short time-span (2 months)

-- object detection

#deep_learning

Google AI Open Images - Object Detection Track

Detect objects in varied and complex images.


snakers4 (Alexander), July 03, 07:15

A cool article from Ben Evans about how to think about ML

www.ben-evans.com/benedictevans/2018/06/22/ways-to-think-about-machine-learning-8nefy

Ways to think about machine learning

We're now four or five years into the current explosion of machine learning, and pretty much everyone has heard of it, and every big company is working on projects around ‘AI’. We know this is a Next Big Thing. I don't think, though, that we yet have a settled sense of quite what machine learning m


My recent PyTorch 0.4 Dockerfile for CV

gist.github.com/snakers4/72ccc3d936f04a3307d20f1810b2fa81

#deep_learning

My PyTorch 0.4 Dockerfile


snakers4 (Alexander), July 02, 04:51

2018 DS/ML digest 14

Amazing article - why you do not need ML

- cyberomin.github.io/startup/2018/07/01/sql-ml-ai.html

- I personally love plain-vanilla SQL and in 90% of cases people under-use it

- I even wrote 90% of my JSON API on our blog in pure PostgreSQL xD

Practice / papers

(0) Interesting papers from CVPR towardsdatascience.com/the-10-coolest-papers-from-cvpr-2018-11cb48585a49

(1) Some down-to-earth obstacles to ML deploy habr.com/company/hh/blog/415437/

(2) Using synthetic data for CNNs (by Nvidia) - arxiv.org/pdf/1804.06516.pdf

(3) This puzzles me - so much effort and engineering spent on something ... strange and useless - taskonomy.stanford.edu/index.html

On paper they do a cool thing - investigate transfer learning between different domains, but in practice it is done on TF and there is no clear conclusion of any kind

(4) VAE + real datasets siavashk.github.io/2016/02/22/autoencoder-imagenet/ - only small Imagenet (64x64)

(5) Understanding the speed of models deployed on mobile - machinethink.net/blog/how-fast-is-my-model/

(6) A brief overview of multi-modal methods medium.com/mlreview/multi-modal-methods-image-captioning-from-translation-to-attention-895b6444256e

Visualizations / explanations

(0) Amazing website with ML explanations explained.ai/

(1) PCA and linear VAEs are close pvirie.wordpress.com/2016/03/29/linear-autoencoders-do-pca/

#deep_learning

#digest

#data_science

No, you don't need ML/AI. You need SQL

A while ago, I did a Twitter thread about the need to use traditional and existing tools to solve everyday business problems other than jumping on new buzzwords, sexy and often times complicated technologies.


snakers4 (Alexander), July 01, 11:48

Measuring feature importance properly

explained.ai/rf-importance/index.html

Once again stumbled upon an amazing article about measuring feature importance for any ML algorithms:

(0) Permutation importance - if your ML algorithm is costly, then you can just shuffle a column and check importance

(1) Drop column importance - drop a column, re-train a model, check performance metrics

Why it is useful / caveats

(0) If you really care about understanding your domain - feature importances are a must have

(1) All of this works only for powerful models

(2) Landmines include - correlated or duplicate variables, data normalization

Correlated variables

(0) For RF - correlated variables share permutation importance roughly proportionally to their correlation

(1) Drop column importance can behave unpredictably

I personally like engineering different kinds of features and doing ablation tests:

(0) Among feature sets, sharing similar purpose

(1) Within feature sets

#data_science

snakers4 (Alexander), June 28, 15:22

Playing with PyTorch 0.4

It was released some time ago

If you are not aware - this is the best summary

pytorch.org/2018/04/22/0_4_0-migration-guide.html

My first-hand experiences

- Multi-GPU support works strangely

- If you just launch your 0.3 code it will work on 0.4 with warnings - not a really breaking change

- All the new features are really cool, useful and make using PyTorch even more delightful

- I especially liked how they added context managers and cleaned up the device mess

#deep_learning

snakers4 (Alexander), June 28, 11:18

DL Framework choice - 2018

If you are still new to DL / DS / ML and have not yet chosen your framework, consider reading this before proceeding

- deepsense.ai/keras-or-pytorch/

#deep_learning

snakers4 (Alexander), June 28, 07:59

Forwarded from Hacker News:

Python 3.7 released (Score: 100+ in 2 hours)

Link: readhacker.news/s/3MawZ

Comments: readhacker.news/c/3MawZ

Python Release Python 3.7.0

The official home of the Python Programming Language


snakers4 (Alexander), June 28, 07:43

2018 DS/ML digest 13

Blog posts / articles:

(0) Google notes on CNN generalization - goo.gl/XS4KAw

(1) Google to teaching robots in virtual environment and then trasferring models to reality - goo.gl/aAYCqE

(2) Google's object tracking via image colorization - goo.gl/xchvBQ

(2) Interesting articles about VAEs:

- A small intro into VAEs habr.com/company/otus/blog/358946/

- A small intuitive intro (super super cool and intuitive)

towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

- KL divergence explained

www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

- A more formal write-up arxiv.org/abs/1606.05908

- In (RU) habr.com/company/otus/blog/358946/

- Converting a FC layer into a conv layer cs231n.github.io/convolutional-networks/#convert

- A post by Fchollet blog.keras.io/building-autoencoders-in-keras.html

A good in-depth write-up on object detection:

- machinethink.net/blog/object-detection/

- finally a decent explanation of YOLO parametrization machinethink.net/images/object-detection/[email protected]

- best comparison of YOLO and SSD ever - machinethink.net/images/object-detection/[email protected]

Papers with interesting abstracts (just good to know sich things exist)

- Low-bit CNNs - ai.intel.com/nervana/wp-content/uploads/sites/53/2018/06/ELQ_CameraReady_CVPR2018.pdf

- Automated Meta ML - arxiv.org/abs/1806.06927

- Idea - use ResNet blocks for boosting - arxiv.org/abs/1706.04964

- 2D-discrete-Fourier transform (2D-DFT) to encode rotational invariance in neural networks - arxiv.org/abs/1805.12301

- Smallify the CNNs - arxiv.org/abs/1806.03723

- BLEU review as a metric - conclusion - it is good on average to measure MT performance - www.mitpressjournals.org/doi/abs/10.1162/COLI_a_00322

"New" ideas in SemSeg:

- UNET + conditional VAE arxiv.org/abs/1806.05034

- Dilated convolutions for larget satellite images arxiv.org/abs/1709.00179 - looks like that this works only if you have high resolution with small objects

#digest

#deep_learning

How Can Neural Network Similarity Help Us Understand Training and Generalization?

Posted by Maithra Raghu, Google Brain Team and Ari S. Morcos, DeepMind In order to solve tasks, deep neural networks (DNNs) progressively...


snakers4 (Alexander), June 26, 19:25

www.youtube.com/watch?v=Te0L5_u_wIg

This AI Detects DeepFakes | Two Minute Papers #259
The paper "FaceForensics: A Large-scale Video Dataset for Forgery Detection in Human Faces " is available here: niessnerlab.org/projects/roessler2018f...

snakers4 (Alexander), June 26, 07:02

If someone needs a dataset, Kaggle launched ImageNet object detection

- www.kaggle.com/c/imagenet-object-localization-challenge#description

There is an open images dataset, which I guess is bigger though

#deep_learning

ImageNet Object Localization Challenge

Identify the objects in images


snakers4 (Alexander), June 25, 15:57

Forwarded from Just links:

blog.openai.com/openai-five/

OpenAI Five

Our team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2.


snakers4 (Alexander), June 25, 10:53

A subscriber sent a really decent CS university scientific ranking

csrankings.org/#/index?all&worldpu

Useful, if you want to apply for CS/ML based Ph.D. there

#deep_learning

Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch

github.com/huggingface/pytorch-openai-transformer-lm

Nice!

#nlp

#deep_learning

huggingface/pytorch-openai-transformer-lm

pytorch-openai-transformer-lm - A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI


snakers4 (Alexander), June 23, 16:46

youtu.be/Nq2xvsVojVo

Better Video Impersonations with AI | Two Minute Papers #258
The paper "Deep Video Portraits" is available here: gvv.mpi-inf.mpg.de/projects/DeepVideoPortraits/ Pick up cool perks on our Patreon page: ww...

snakers4 (Alexander), June 23, 12:10

Interesting links about Internet

- Ben Evans' digest - goo.gl/t9zG4y

- China plans to track cars - goo.gl/jeroFW

- Ben Evans - content is not king anymore - distribution / eco-system are goo.gl/ms2tQd

- Google opens AI center in Ghana - goo.gl/PRHBjq

- (RU) A funny case on censorship in Russia - funny article deleted from habr - sohabr.net/habr/post/414595/

-- It kind of clearly shows that you cannot safely post anything to habr

- India + WhatsApp + lynch mobs - goo.gl/tSBUCp

- Tor foundation about web-tracking and Facebook - goo.gl/H9DSuL

- Docker image jacking for crypto-mining - goo.gl/KrLLuQ

- Ethereum - 75% transactions automated bots - goo.gl/Q9BSNL

- (RU) - analyzing fake elections in Russia - 3-10M votes are fake - habr.com/post/358790/

#internet

2018 DS/ML digest 12

As usual, this is whatever I found really interesting / worth reading.

Implementations / papers / ideas

(0)

You can count bees well with UNet - matpalm.com/blog/counting_bees/

(1)

A really super cool idea - use affine transformations in 3D to stack augmentations on the level of transformation matrices

(3D augs are costly)

- gist.github.com/ematvey/5ca7df5d37c2f6a674390d42ef9e7d59

- both for rotation and scaling

- note a couple of things for easier understanding:

-- there is offset in tranformations - because the coordinate center is not in "center"

-- zoom essentially scales unit vectors after applying the offset

- 3Blue1Brown videos about linear algebra - www.youtube.com/watch?v=fNk_zzaMoSs

(2)

A top solution from Google's Landmark Challenge - goo.gl/pkZULZ

Essentially

- ensemble of features / skip connections from a CNN (ResNeXt)

- KNN

- use KNN + augment the extracted features by averaging with similar images

- query expansion (use the fact that different crops of the same landmark remain the same landmark)

(3)

(RU) A super cool series about interestring clustering algorithms

- Affinity propagation

-- habr.com/post/321216/

-- www.icmla-conference.org/icmla07/FreyDueckScience07.pdf

- DBSCAN habrahabr.ru/post/322034/

- (spoiler - in practice use awesome HDBSCAN library)

(4)

Brief review of image super-resolution techniques

- habr.com/post/359016/

- In a nutshell try in this order FCN CNNs, auto-encoders with skip connections or GANs

(5)

SOTA NLP by open-ai

blog.openai.com/language-unsupervised/

Key ideas

- Train a transformer language models on large corpus in an unsupervised way

- Fine-tune on a smaller task

- Profit

Caveats

- "Our approach requires an expensive pre-training step - 1 month on 8 GPUs" (probably this should be discounted somewhat)

- TF and unreadable enterprise code

(6)

One more claimed SOTA word embedding set

allennlp.org/elmo

(7)

A cool github page by Sebastian Ruder to track major NLP tasks

github.com/sebastianruder/NLP-progress

Visualizations

(0)

Amazing visual explanations of how decision trees work

- www.r2d3.us/visual-intro-to-machine-learning-part-2/

- it explains visually how overfitting occurs in decisions tree models

(1)

CIFAR T-SNE can be done in real-time on the GPU + tensorflow.js integration

- Blog goo.gl/Pk5Lq3

- Website goo.gl/1vpeFf

- Arxiv - arxiv.org/abs/1802.03680

- Demo - nicola17.github.io/tfjs-tsne-demo/

(2) Why people fail to use d3.js - goo.gl/hSt5dL

Datasets

(0) Nice idea - use available tools and videos to collect datasets

- goo.gl/HULsyH

- goo.gl/7AfRZZ

#digest