Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1361 members, 1660 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

snakers4 (Alexander), November 10, 10:30

Playing with Transformer

TLDR - use only pre-trained.

On classification tasks performed the same as classic models.

On seq2seq - much worse time / memory wise. Inference is faster though.

#nlp

snakers4 (Alexander), November 09, 13:48

Fast-text trained on a random mix of Russian Wikipedia / Taiga / Common Crawl

On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.

Download link

goo.gl/g6HmLU

Params

Standard params - (3,6) n-grams + vector dimensionality is 300.

Usage:

import fastText as ft

ft_model_big = ft.load_model('model')And then just refer to

github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py

#nlp

snakers4 (Alexander), November 06, 13:45

DS/ML digest 28

Google open sources pre-trained BERT ... with 102 languages ...

spark-in.me/post/2018_ds_ml_digest_28

#digest

#deep_learning

#data_science

2018 DS/ML digest 28

2018 DS/ML digest 28 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 06, 12:38

A small saga about keeping GPUs cool

(1) 1-2 GPUs with blower fans (or turbo fans) in a full tower

-- idle 40-45C

-- full load - 80-85C

(2) 3-4 GPUs with blower fans (or turbo fans) in a full tower

-- idle - 45-55C

-- full load - 85-95С

Also with 3-4+ GPUs your room starts to heat up significantly + even without full fan speed / overclocking the sound is not very pleasant.

Solutions:

(0) Add a corrugated air duct to dump heat outside minus 3-5C under load;

(1) Add a high-pressure fan to blow between the GPUs minus 3-5C under load;

(2) Place the tower on the balcony minus 3-5C under load;

In the end it is possible to achieve <75C under full load on 4 or even 6 GPUs.

#deep_learning

snakers4 (Alexander), November 05, 14:59

Forwarded from Just links:

DropBlock: A regularization method for convolutional networks arxiv.org/abs/1810.12890

Forwarded from Just links:

github.com/Randl/DropBlock-pytorch

Randl/DropBlock-pytorch

Implementation of DropBlock in Pytorch. Contribute to Randl/DropBlock-pytorch development by creating an account on GitHub.


snakers4 (Alexander), November 03, 10:04

Also reposts on additional platforms

- Habr - habr.com/post/428674/

Please support us if you have an account.

Building client routing / semantic search at Profi.ru

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru TLDR This is a very short executive summary (or a teaser) about...


snakers4 (Alexander), November 03, 09:40

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

A brief executive summary about what we achieved at Profi.ru.

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.

spark-in.me/post/profi-ru-semantic-search-project

#nlp

#data_science

#deep_learning

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 30, 08:04

Forwarded from Админим с Буквой:

Google запускает reCaptcha v3

youtu.be/tbvxFW4UJdU

#news

Introducing reCAPTCHA v3
reCAPTCHA v3 is a new version that detects abusive traffic on your website without user friction. It returns a score for each request you send to reCAPTCHA a...

snakers4 (Alexander), October 27, 17:58

www.youtube.com/watch?v=F-00NhYUnH4

This AI Learned How To Generate Human Appearance
Pick up cool perks on our Patreon page: › www.patreon.com/TwoMinutePapers The paper "A Variational U-Net for Conditional Appearance and Shape Generat...

snakers4 (Alexander), October 27, 10:16

Canonical one-hot encoding one-liner in PyTorch

Or 2 liner, whatever)

# src - is the input tensor (batch,indexes)

trg_oh = torch.FloatTensor(src.size(0), src.size(1), self.tgt_vocab).zero_().to(self.device)

trg_oh.scatter_(2, trg, 1)#deep_learning

snakers4 (Alexander), October 26, 12:31

A sticker pack for our channel / group

We decided to draw a sticker pack for our telegram channel / group with @birdborn

Please help select the best stickers!

Please vote here:

goo.gl/forms/dPPUADKEM4Zq1YkI2

(Russian)

С каких стикеров начать?

Выбираем топ стикеров, с которым начнем рисовать!


snakers4 (Alexander), October 24, 09:11

Concurrent Spatial and Channel Squeeze &amp; Excitation in Fully Convolutional Networks

- Essentially attention for semseg model - channel-wise attention, spatial and mixed attention

- Paper arxiv.org/abs/1803.02579

- Implementation www.kaggle.com/c/tgs-salt-identification-challenge/discussion/66178

#deep_learning

snakers4 (Alexander), October 23, 07:04

Do you read digests?

anonymous poll

Yes – 37

👍👍👍👍👍👍👍 61%

Love them – 12

👍👍 20%

No – 11

👍👍 18%

I have an idea how to improve them (PM me) – 1

▫️ 2%

👥 61 people voted so far.

snakers4 (Alexander), October 23, 06:28

DS/ML digest 27

NLP in the focus again!

spark-in.me/post/2018_ds_ml_digest_27

Also your humble servant learned how to do proper NMT =)

#digest

#deep_learning

#data_science

2018 DS/ML digest 27

2018 DS/ML digest 27 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 22, 11:26

In case of github failure

They have a blog with current statuses

status.github.com/messages

snakers4 (Alexander), October 22, 05:43

Amazing articles about image hashing

Also a python library

- Library github.com/JohannesBuchner/imagehash

- Articles:

fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

#data_science

#computer_vision

JohannesBuchner/imagehash

A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.


Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models - torchtext.readthedocs.io.

It is explained here - bastings.github.io/annotated_encoder_decoder/ - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.

#nlp

#deep_learning

snakers4 (Alexander), October 19, 18:34

youtu.be/uEJ71VlUmMQ

Detecting Faces (Viola Jones Algorithm) - Computerphile
Deep learning is used for everything these days, but this face detection algorithm is so neat its still in use today. Dr Mike Pound on the Viola/Jones algori...

snakers4 (Alexander), October 17, 14:41

Forwarded from Sava Kalbachou:

twitter.com/fchollet/status/1052228463300493312

François Chollet

Here is the same dynamic RNN implemented in 4 different frameworks (TensorFlow/Keras, MXNet/Gluon, Chainer, PyTorch). Can you tell which is which?


I guess PyTorch is in the bottom left corner, but realistically the author of this snippet did a lot of import A as B

snakers4 (Alexander), October 16, 05:15

Google's super resolution zoom

Finally Google made something interesting

www.youtube.com/watch?v=z-ZJqd4eQrc

ai.googleblog.com/2018/10/see-better-and-further-with-super-res.html

Super Res Zoom

snakers4 (Alexander), October 16, 03:47

Mixed precision distributed training ImageNet example in PyTorch

github.com/NVIDIA/apex/blob/master/examples/imagenet/main.py

#deep_learning

NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex


snakers4 (Alexander), October 15, 17:11

Looks like mixed precision training ... is solved in PyTorch

Lol - and I could not find it

github.com/NVIDIA/apex/tree/master/apex/amp

#deep_learning

NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex


snakers4 (Alexander), October 15, 16:56

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:

www.zotero.org/support/kb/mendeley_import

#data_science

kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.


www.youtube.com/watch?v=KJAnSyB6mME

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 09:33

DS/ML digest 26

More interesting NLP papers / material ...

spark-in.me/post/2018_ds_ml_digest_26

#digest

#deep_learning

#data_science

2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 12, 19:11

www.youtube.com/watch?v=kBFMsY5ZP0o

This AI Senses Humans Through Walls
Pick up cool perks on our Patreon page: › www.patreon.com/TwoMinutePapers Crypto and PayPal links are available below. Thank you very much for your g...

snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science

snakers4 (Alexander), October 08, 10:11

A small continuation of the crawling saga

2 takes on the Common Crawl

spark-in.me/post/parsing-common-crawl-in-four-simple-commands

spark-in.me/post/parsing-common-crawl-in-two-simple-commands

It turned out to be a bit tougher than expected

But doable

#nlp

Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 08, 06:04

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- vaex.io/ for large data-frames + some nice visualizations;

- Datashader.org for large visualizations;

- Also you can use Dask for these purposes I guess jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.

#linux

snakers4 (Alexander), October 08, 05:38

Wiki graph database

Just found out that Wikipedia also provides this

- wiki.dbpedia.org/OnlineAccess

- wiki.dbpedia.org/downloads-2016-10#p10608-2

May be useful for research in future.

Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.

Example queries:

People who were born in Berlin before 1900

German musicians with German and English descriptions

Musicians who were born in Berlin

Games

#data_science

snakers4 (Alexander), October 06, 13:04

PCIE risers that REALLY WORK for DL

Thermaltake TT Premium PCIE 3.0 extender.

All the others I tried were crap.

#deep_learning

snakers4 (Alexander), October 06, 07:24

Monkey patching a PyTorch model

Well, ideally you should not do this.

But sometimes you just need to quickly test something and amend your model on the fly.

This helps:

import torch

import functools

def rsetattr(obj, attr, val):

pre, _, post = attr.rpartition('.')

return setattr(rgetattr(obj, pre) if pre else obj, post, val)

def rgetattr(obj, attr, *args):

def _getattr(obj, attr):

return getattr(obj, attr, *args)

return functools.reduce(_getattr, [obj] + attr.split('.'))

for module in model.named_modules():

old_module_path = module[0]

old_module_object = module[1]

# replace an old object with the new one

# copy some settings and its state

if isinstance(old_module_object,torch.nn.SomeClass):

new_module = SomeOtherClass(old_module_object.some_settings,

old_module_object.some_other_settings)

new_module.load_state_dict(module_object.state_dict())

rsetattr(model,old_module_path,new_module)

The above code essentially does the same as:

model

.path.to.some.block = some_other_block

`

#python

#pytorch

#deep_learning

#oop