Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1369 members, 1636 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

Posts by tag «data_science»:

snakers4 (Alexander), November 15, 08:09

DS/ML digest 29

spark-in.me/post/2018_ds_ml_digest_29

#digest

#deep_learning

#data_science

2018 DS/ML digest 29

2018 DS/ML digest 29 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 10, 18:32

Towards Data Science

Our article was accepted to their publication:

- towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e

Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)

I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.

#data_science

Building client routing / semantic search in the wild

A comparison of novel NLP techniques within an applied business setting


snakers4 (Alexander), November 06, 13:45

DS/ML digest 28

Google open sources pre-trained BERT ... with 102 languages ...

spark-in.me/post/2018_ds_ml_digest_28

#digest

#deep_learning

#data_science

2018 DS/ML digest 28

2018 DS/ML digest 28 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 03, 09:40

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

A brief executive summary about what we achieved at Profi.ru.

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.

spark-in.me/post/profi-ru-semantic-search-project

#nlp

#data_science

#deep_learning

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 23, 06:28

DS/ML digest 27

NLP in the focus again!

spark-in.me/post/2018_ds_ml_digest_27

Also your humble servant learned how to do proper NMT =)

#digest

#deep_learning

#data_science

2018 DS/ML digest 27

2018 DS/ML digest 27 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 22, 05:43

Amazing articles about image hashing

Also a python library

- Library github.com/JohannesBuchner/imagehash

- Articles:

fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

#data_science

#computer_vision

JohannesBuchner/imagehash

A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.


Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models - torchtext.readthedocs.io.

It is explained here - bastings.github.io/annotated_encoder_decoder/ - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.

#nlp

#deep_learning

snakers4 (Alexander), October 15, 16:56

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:

www.zotero.org/support/kb/mendeley_import

#data_science

kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.


www.youtube.com/watch?v=KJAnSyB6mME

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 09:33

DS/ML digest 26

More interesting NLP papers / material ...

spark-in.me/post/2018_ds_ml_digest_26

#digest

#deep_learning

#data_science

2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science

snakers4 (Alexander), October 08, 06:04

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- vaex.io/ for large data-frames + some nice visualizations;

- Datashader.org for large visualizations;

- Also you can use Dask for these purposes I guess jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.

#linux

snakers4 (Alexander), October 08, 05:38

Wiki graph database

Just found out that Wikipedia also provides this

- wiki.dbpedia.org/OnlineAccess

- wiki.dbpedia.org/downloads-2016-10#p10608-2

May be useful for research in future.

Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900

German musicians with German and English descriptions

Musicians who were born in Berlin

Games

#data_science

snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp

medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee

Please like / share / repost the article =)

#nlp

#data_science

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 02, 09:59

PyTorch 1.0 PRE-RELEASE

github.com/pytorch/pytorch/releases/tag/v1.0rc0

Looks like it features tools to deploy PyTorch models...

#data_science

pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch


snakers4 (Alexander), September 29, 10:53

Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:

- drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science

snakers4 (Alexander), September 28, 11:12

DS/ML digest 25

spark-in.me/post/2018_ds_ml_digest_25

#digest

#deep_learning

#data_science

2018 DS/ML digest 25

2018 DS/ML digest 25 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 28, 05:40

New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization

- www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.

At least it looks like they are not pushing their crappy library =)

The problem with any such visualizations is that they work only for toy datasets.

Drop / shuffle method seems to be more robust.

#data_science

snakers4 (Alexander), September 24, 12:53

(RU) most popular ML algorithms explained in simple terms

vas3k.ru/blog/machine_learning/

#data_science

Машинное обучение для людей

Разбираемся простыми словами


snakers4 (Alexander), September 20, 16:06

DS/ML digest 24

Key topics of this one:

- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;

- So many releases from Google;

spark-in.me/post/2018_ds_ml_digest_24

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

#digest

#deep_learning

#data_science

2018 DS/ML digest 24

2018 DS/ML digest 24 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 06, 05:48

DS/ML digest 23

The key topic of this one - is this is insanity

- vid2vid

- unsupervised NMT

spark-in.me/post/2018_ds_ml_digest_23

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

Let's spread the right DS/ML ideas together.

#digest

#deep_learning

#data_science

2018 DS/ML digest 23

2018 DS/ML digest 23 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 05, 06:40

MySQL - replacing window functions

Older versions of MySQL (and maybe newer ones) do not have all the goodness you can find in PostgreSQL. Ofc you can do plain session matching in Python, but sometimes you just need to do it in plain SQL.

In Postgres you usually use window functions for this purpose if you need PLAIN SQL (ofc there are stored procedures / views / mat views etc).

In MySQL it can be elegantly solved like this:

SET @session_number = 0, @last_uid = '0', @current_id = '0', @dif=0;

SELECT

t1.some_field,

t2.some_field,

...

@last_uid:[email protected]_uid,

@current_uid:=t1.uid,

@dif:=TIMESTAMPDIFF(MINUTE, t2.session_ts, t1.session_ts),

if(@[email protected]_uid, if(@dif > 30,@session_number:[email protected]_number+1,@session_number),@session_number:=0) as session

FROM

table1 t1

JOIN table2 t2 on t1.id = t2.id+1

#data_science

snakers4 (Alexander), August 31, 13:59

DS/ML digest 22

spark-in.me/post/2018_ds_ml_digest_22

#digest

#deep_learning

#data_science

2018 DS/ML digest 22

2018 DS/ML digest 22 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), August 29, 05:45

Venn diagrams in python

Compare sets as easy as:

# Import the library

import matplotlib.pyplot as plt

from matplotlib_venn import venn3

# Make the diagram

plt.figure(figsize=(10,10))

venn3(subsets = (s1,s2,s3),set_labels=['synonyms','our_tree','add_syns'])

plt.show()

Very simple and useful.

#data_science

snakers4 (Alexander), August 28, 14:42

Garbage collection and encapsulation in python

Usually this is not an issue.

But when doing a batch-job / loop / something with 100k+ objects it suddenly becomes an issue.

Also it does not help that the environment you are testing your code and the actual running environments are different.

The solution - simply encapsulate everything you can. Garbage is automatically collected.

#data_science

A case against Kaggle

If you thought that Kaggle is the home of Data Science - think again.

www.kaggle.com/c/airbus-ship-detection/discussion/64355

This is official - they do not know the hell they are doing.

There have been several appalling cases already, but this takes the prize.

Following this thread, wrote a small petition to Kaggle

www.kaggle.com/c/airbus-ship-detection/discussion/64393

I doubt that they will hear, but why not.

#data_science

snakers4 (Alexander), August 24, 07:04

Played with FAISS

Played with FAISS on the GPU a bit. Their docs also cover simplistic cases really well, but more sophisticated cases are better inferred from examples, because their Python docs are not really intended for heavy use.

Anyway I manged to build a KNN graph with FAISS on on GPU for 10m points in 2-3 hours.

It does the following:

- KNN graph;

- PCA, K-Means;

- Queries;

- VERY sophisticated indexing with many option;

It supports:

- GPU;

- Multi-GPU (I had to use env. variables to limit GPU list, because there is no option for python);

Also their docs are awsome for such a low-level project.

github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

#data_science

#similarity

facebookresearch/faiss

A library for efficient similarity search and clustering of dense vectors. - facebookresearch/faiss


Playing with Atom and Hydrogen

TLDR - it delivers to turn the Atom editor into something like interactive notebook, but its auto-completion options are very scarce. Also you can connect to your running ipython kernel as well as docker container.

nteract.gitbooks.io/hydrogen/docs/Installation.html

blog.nteract.io/hydrogen-interactive-computing-in-atom-89d291bcc4dd

But it is no match to either a normal notebook or a python IDE.

#data_science

snakers4 (Alexander), August 12, 11:15

2018 DS/ML digest 20

spark-in.me/post/2018_ds_ml_digest_20

#deep_learning

#digest

#data_science

2018 DS/ML digest 20

2018 DS/ML digest 20 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), August 12, 05:21

New publication format

Decided to try a new, more streamlined, fast and automated approach to publishing a bit longer posts.

(0) Write a note in md format

(1) Transform to HTML automatically => post on spark-in.me

(2) Repost to medium via automated import

(3) Repost to Reddit / Habr.com (if they start accepting English articles) via md

...

(4) Profit - 4 publications at the cost and time of one)

Decided to start with porting 2 latest articles to medium

- medium.com/@aveysov/playing-with-crowd-ai-mapping-challenge-or-how-to-improve-your-cnn-performance-with-109684f95dcd

- medium.com/@aveysov/solving-class-imbalance-on-google-open-images-cf9e890bb146

Please tell me what you think in the comments!

Also md can be transformed almost to any format using pandoc)

#data_science

Playing with Crowd-AI mapping challenge — or how to improve your CNN performance with self-supervised techniques

Originally published at spark-in.me on July 15, 2018.


snakers4 (Alexander), August 07, 04:04

UMAP

github.com/lmcinnes/umap

Wrote a couple of posts about UMAP before.

Since last time, they extended their docs and published a paper:

- How it works umap-learn.readthedocs.io/en/latest/how_umap_works.html (topology) - I kind of understand 50% of this

- Paper arxiv.org/abs/1802.03426 (have not read yet)

What I really like about UMAP author - he answers questions on the forums / invested a lot of time into explaining how UMAP and HDBSCAN work / built stellar docs and is overall a nice guy.

What I really like in practice - this combination works really well:

- PCA => UMAP => HDBSCAN

#data_science

lmcinnes/umap

Uniform Manifold Approximation and Projection. Contribute to lmcinnes/umap development by creating an account on GitHub.


snakers4 (Alexander), July 31, 04:42

Airbus ship detection challenge

On a surface this looks like a challenging and interesting competition:

- www.kaggle.com/c/airbus-ship-detection

- Train / test sets - 14G / 12G

- Downside - Kaggle and very fragile metric

- Upside - a separate significant price for fast algorithms!

- 768x768 images seem reasonable

#deep_learning

#data_science

Airbus Ship Detection Challenge

Find ships on satellite images as quickly as possible


snakers4 (Alexander), July 23, 06:26

My post on open images stage 1

For posterity

Please comment

spark-in.me/post/playing-with-google-open-images

#deep_learning

#data_science

Solving class imbalance on Google open images

In this article I propose an appoach to solve a severe class imbalance on Google open images Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), July 23, 05:15

2018 DS/ML digest 18

Highlights of the week

(0) RL flaws

thegradient.pub/why-rl-is-flawed/

thegradient.pub/how-to-fix-rl/

(1) An intro to AUTO-ML

www.fast.ai/2018/07/16/auto-ml2/

(2) Overview of advances in ML in last 12 months

www.stateof.ai/

Market / applied stuff / papers

(0) New Nvidia Jetson released

www.phoronix.com/scan.php?page=news_item&px=NVIDIA-Jetson-Xavier-Dev-Kit

(1) Medical CV project in Russia - 90% is data gathering

cv-blog.ru/?p=217

(2) Differentiable architecture search

arxiv.org/pdf/1806.09055.pdf

-- 1800 GPU days of reinforcement learning (RL) (Zoph et al., 2017)

-- 3150 GPU days of evolution (Real et al., 2018)

-- 4 GPU days to achieve SOTA in CIFAR => transferrable to Imagenet with 26.9% top-1 error

(3) Some basic thoughts about hyper-param tuning

engineering.taboola.com/hitchhikers-guide-hyperparameter-tuning/

(4) FB extending fact checking to mark similar articles

www.poynter.org/news/rome-facebook-announces-new-strategies-combat-misinformation

(5) Architecture behind Alexa choosing skills goo.gl/dWmXZf

- Char-level RNN + Word-level RNN

- Shared encoder, but attention is personalized

(6) An overview of contemporary NLP techniques

medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e

(7) RNNs in particle physics?

indico.cern.ch/event/722319/contributions/3001310/attachments/1661268/2661638/IML-Sequence.pdf?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=NLP%20News

(8) Google cloud provides PyTorch images

twitter.com/i/web/status/1016515749517582338

NLP

(0) Use embeddings for positions - no brainer

twitter.com/i/web/status/1018789622103633921

(1) Chatbots were a hype train - lol

medium.com/swlh/chatbots-were-the-next-big-thing-what-happened-5fc49dd6fa61

The vast majority of bots are built using decision-tree logic, where the bot’s canned response relies on spotting specific keywords in the user input.Interesting links

(0) Reasons to use OpenStreetMap

www.openstreetmap.org/user/jbelien/diary/44356

(1) Google deployes its internet ballons

goo.gl/d5cv6U

(2) Amazing problem solving

nevalalee.wordpress.com/2015/11/27/the-hotel-bathroom-puzzle/

(3) Nice flame thread about CS / ML is not science / just engineering etc

twitter.com/RandomlyWalking/status/1017899452378550273

#deep_learning

#data_science

#digest

RL’s foundational flaw

RL as classically formulated has lately accomplished many things - but that formulation is unlikely to tackle problems beyond games. Read on to see why!