Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1356 members, 1614 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

snakers4 (Alexander), October 19, 18:34

youtu.be/uEJ71VlUmMQ

Detecting Faces (Viola Jones Algorithm) - Computerphile
Deep learning is used for everything these days, but this face detection algorithm is so neat its still in use today. Dr Mike Pound on the Viola/Jones algori...

snakers4 (Alexander), October 17, 14:41

Forwarded from :

twitter.com/fchollet/status/1052228463300493312

François Chollet

Here is the same dynamic RNN implemented in 4 different frameworks (TensorFlow/Keras, MXNet/Gluon, Chainer, PyTorch). Can you tell which is which?


I guess PyTorch is in the bottom left corner, but realistically the author of this snippet did a lot of import A as B

snakers4 (Alexander), October 16, 05:15

Google's super resolution zoom

Finally Google made something interesting

www.youtube.com/watch?v=z-ZJqd4eQrc

ai.googleblog.com/2018/10/see-better-and-further-with-super-res.html

Super Res Zoom

snakers4 (Alexander), October 16, 03:47

Mixed precision distributed training ImageNet example in PyTorch

github.com/NVIDIA/apex/blob/master/examples/imagenet/main.py

#deep_learning

NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex


snakers4 (Alexander), October 15, 17:11

Looks like mixed precision training ... is solved in PyTorch

Lol - and I could not find it

github.com/NVIDIA/apex/tree/master/apex/amp

#deep_learning

NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex


snakers4 (Alexander), October 15, 16:56

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:

www.zotero.org/support/kb/mendeley_import

#data_science

kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.


www.youtube.com/watch?v=KJAnSyB6mME

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 09:33

DS/ML digest 26

More interesting NLP papers / material ...

spark-in.me/post/2018_ds_ml_digest_26

#digest

#deep_learning

#data_science

2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 12, 19:11

www.youtube.com/watch?v=kBFMsY5ZP0o

This AI Senses Humans Through Walls
Pick up cool perks on our Patreon page: › www.patreon.com/TwoMinutePapers Crypto and PayPal links are available below. Thank you very much for your g...

snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science

snakers4 (Alexander), October 08, 10:11

A small continuation of the crawling saga

2 takes on the Common Crawl

spark-in.me/post/parsing-common-crawl-in-four-simple-commands

spark-in.me/post/parsing-common-crawl-in-two-simple-commands

It turned out to be a bit tougher than expected

But doable

#nlp

Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 08, 06:04

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- vaex.io/ for large data-frames + some nice visualizations;

- Datashader.org for large visualizations;

- Also you can use Dask for these purposes I guess jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.

#linux

snakers4 (Alexander), October 08, 05:38

Wiki graph database

Just found out that Wikipedia also provides this

- wiki.dbpedia.org/OnlineAccess

- wiki.dbpedia.org/downloads-2016-10#p10608-2

May be useful for research in future.

Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900

German musicians with German and English descriptions

Musicians who were born in Berlin

Games

#data_science

snakers4 (Alexander), October 06, 13:04

PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.

All the others I tried were crap.

#deep_learning

snakers4 (Alexander), October 06, 07:24

Monkey patching a PyTorch model

Well, ideally you should not do this.

But sometimes you just need to quickly test something and amend your model on the fly.

This helps:

import torch

import functools

def rsetattr(obj, attr, val):

pre, _, post = attr.rpartition('.')

return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):

def _getattr(obj, attr):

return getattr(obj, attr, *args)

return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():

old_module_path = module[0]

old_module_object = module[1]

# replace an old object with the new one

# copy some settings and its state

if isinstance(old_module_object,torch.nn.SomeClass):

new_module = SomeOtherClass(old_module_object.some_settings,

old_module_object.some_other_settings)

new_module.load_state_dict(module_object.state_dict())

rsetattr(model,old_module_path,new_module)

The above code essentially does the same as:

model

.path.to.some.block = some_other_block

`

#python

#pytorch

#deep_learning

#oop

snakers4 (Alexander), October 05, 16:47

Russian post on Habr

habr.com/post/425507/

Please support if you have an account.

#nlp

Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget http...


snakers4 (Alexander), October 05, 16:29

Head of DS in Ostrovok (Moscow)

Please contact @eshneyderman (Евгений Шнейдерман) if you are up to the challenge.

#jobs

Forwarded from Felix Shpilman:

Evgeny Shneyderman:

hh.ru/vacancy/27723418

Вакансия Head Of Data Science в Москве, работа в Островок.ру

Вакансия Head Of Data Science. Зарплата: не указана. Москва. Требуемый опыт: 3–6 лет. Полная занятость. Дата публикации: 07.10.2018.


snakers4 (Alexander), October 04, 06:03

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):

ngrams = zip(*[string[i:] for i in range(n)])

ngrams = [''.join(_) for _ in ngrams]

return ngrams

string = 'грёзоблаженствующий'

ngrams = []

for i in range(3,7):

ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)

ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))

print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))#nlp

snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp

medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee

Please like / share / repost the article =)

#nlp

#data_science

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 03, 15:15

Forwarded from Админим с Буквой:

Github и ssh-ключи

Узнал о такой фишке в гитхабе, что по такой ссылке можно забрать публичный ключ аккаунта. Типа удобно передавать ключ просто ссылкой.

github.com/bykvaadm.keys

#github

snakers4 (Alexander), October 02, 09:59

PyTorch 1.0 PRE-RELEASE

github.com/pytorch/pytorch/releases/tag/v1.0rc0

Looks like it features tools to deploy PyTorch models...

#data_science

pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch


snakers4 (Alexander), October 02, 03:01

New release of keras

github.com/keras-team/keras/releases/tag/2.2.3

#deep_learning

keras-team/keras

Deep Learning for humans. Contribute to keras-team/keras development by creating an account on GitHub.


snakers4 (Alexander), September 29, 10:53

Andrew Ng book

Looks like its draft is finished.

It describes in plain terms how to build ML pipelines:

- drive.google.com/open?id=1aHVZ9pcsGtIcgarZxV-Qfkb0JEvtSLDK

#data_science

snakers4 (Alexander), September 29, 10:48

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)

In case of Russian you can write here

tatianashavrina.github.io/taiga_site/

The author will share her 90+GB RAW corpus with you

(2)

In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with

- commoncrawl.org/connect/blog/

- commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

- www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.


snakers4 (Alexander), September 28, 11:12

DS/ML digest 25

spark-in.me/post/2018_ds_ml_digest_25

#digest

#deep_learning

#data_science

2018 DS/ML digest 25

2018 DS/ML digest 25 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 28, 05:40

New fast.ai course

Mainly decision tree practice.

A lot about decision tree visualization

- www.fast.ai/2018/09/26/ml-launch/

I personally would check out the visualization bits.

At least it looks like they are not pushing their crappy library =)

The problem with any such visualizations is that they work only for toy datasets.

Drop / shuffle method seems to be more robust.

#data_science

snakers4 (Alexander), September 27, 01:47

youtu.be/dyzn3Fmtw-E

This Painter AI Fools Art Historians 39% of the Time
Pick up cool perks on our Patreon page: www.patreon.com/TwoMinutePapers Crypto and PayPal links are available below. Thank you very much for your gen...

snakers4 (Alexander), September 26, 13:22

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)

wget rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp

snakers4 (Alexander), September 24, 12:53

(RU) most popular ML algorithms explained in simple terms

vas3k.ru/blog/machine_learning/

#data_science

Машинное обучение для людей

Разбираемся простыми словами


snakers4 (Alexander), September 20, 16:06

DS/ML digest 24

Key topics of this one:

- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;

- So many releases from Google;

spark-in.me/post/2018_ds_ml_digest_24

If you like our digests, you can support the channel via:

- Sharing / reposting;

- Giving an article a decent comment / a thumbs-up;

- Buying me a coffee (links on the digest);

#digest

#deep_learning

#data_science

2018 DS/ML digest 24

2018 DS/ML digest 24 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 19, 09:59

Using sklearn pairwise cosine similarity

scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.

#nlp