Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1797 members, 1726 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

snakers4 (Alexander), February 27, 12:39

We tried it

... yeah we tried it on a real task

just adam is a bit better

snakers4 (Alexander), February 27, 07:50

New variation of Adam?

- [Website](www.luolc.com/publications/adabound/);

- [Code](github.com/Luolc/AdaBound);

- Eliminate the generalization gap between adaptive methods and SGD;

- TL;DR: A Faster And Better Optimizer with Highly Robust Performance;

- Dynamic bound on learning rates. Inspired by gradient clipping;

- Not very sensitive to the hyperparameters, especially compared with Sgd(M);

- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;

#deep_learning

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.


snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;

spark-in.me/post/2019_ds_ml_digest_04

#digest

#data_science

#deep_learning

2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 17, 10:22

A bit of lazy Sunday admin stuff

Monitoring you CPU temperature with email notifications

- Change CPU temp to any metric you like

- Rolling log

- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)

gist.github.com/snakers4/cf0ffd57c3ef7f4e2e25f6b3347dcdec

Setting up a GPU box on Ubuntu 18.04 from scratch

github.com/snakers4/gpu-box-setup/

#deep_learning

#linux

Plain temperature monitoring in Ubuntu 18.04

Plain temperature monitoring in Ubuntu 18.04. GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), February 17, 08:49

Pinned post

What is this channel about?

(0)

This channel is a practitioner's channel on the following topics: Internet, Data Science, Deep Learning, Python, NLP

(1)

Don't get your opinion in a twist if your opinion differs.

You are welcome to contact me via telegram @snakers41 and email - [email protected]

(2)

No BS and ads - I already rejected 3-4 crappy ad deals

(4)

DS ML digests - in the RSS or via URLs like this

spark-in.me/post/2019_ds_ml_digest_01

Donations

(0)

Buy me a coffee 🤟 buymeacoff.ee/8oneCIN

Give us a rating:

(0)

telegram.me/tchannelsbot?start=snakers4

Our chat

(0)

t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ

More links

(0)

Our website spark-in.me

(1)

Our chat t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ

(2)

DS courses review (RU) - very old

goo.gl/5VGU5A

spark-in.me/post/learn-data-science

(3)

2017 - 2018 SpaceNet Challenge

spark-in.me/post/spacenet-three-challenge

(4)

DS Bowl 2018

spark-in.me/post/playing-with-dwt-and-ds-bowl-2018

(7)

Data Science tag on the website

spark-in.me/tag/data-science

(7)

Profi.ru project

towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e

(8)

CFT 2018 competition

spark-in.me/post/cft-spelling-2018

(9)

2018 retrospective

spark-in.me/post/2018

More amazing NLP-related articles incoming!

Maybe finally we will make podcasts?

2019 DS/ML digest 01

2019 DS/ML digest 01 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 14, 06:20

Whict type of content do you / would you like most on the channel?

  • Weekly / bi-weekly digests; (33)
  • Full articles; (13)
  • Podcasts with actual ML practicioners; (12)
  • Practical bits on real applied NLP; (27)
  • Pre-trained BERT with Embedding Bags for Russian; (11)
  • Paper reviews; (21)
  • Jokes / memes / cats; (9)

126 votes

snakers4 (Alexander), February 13, 09:56

*

(2) is valid for models with complex forward pass and models with large embedding layers

snakers4 (Alexander), February 13, 09:02

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.

Just use nvidia's apex, DistributedDataParallel, DataParallel.

Best examples [here](github.com/huggingface/pytorch-pretrained-BERT).

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.

Why?

So that you can seamleassly you all the abstractions from (1) with ease.

Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':
model.train()
elif loop_type=='Val':
model.eval()

with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here
pass

(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.

#nlp

#deep_learning

huggingface/pytorch-pretrained-BERT

📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT


PyTorch DataLoader, GIL thrashing and CNNs

Well all of this seems a bit like magic to me, but hear me out.

I abused my GPU box for weeks running CNNs on 2-4 GPUs.

Nothing broke.

And then my GPU box started shutting down for no apparent reason.

No, this was not:

- CPU overheating (I have a massive cooler, I checked - it works);

- PSU;

- Overclocking;

- It also adds to confusion that AMD has weird temperature readings;

To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with workers > 0 it can lead to system instability instead of speeding up.

It is obvious in retrospect, but it is not when you face this issue.

#deep_learning

#pytorch

snakers4 (Alexander), February 12, 05:13

Russian thesaurus that really works

nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2

It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!

#nlp

Russian Distributional Thesaurus

Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество гиперонимов и инвентарь смыслов слов. Все ресурсы были построены автоматически на основании корпуса текстов книг на русском языке (12.9 млрд словоупотреблений). В следующих версиях ресурса планируется добавление и векторов смыслов слов для русского языка, которые были получены на основании того же корпуса текстов. Проект разрабатывается усилиями представителей УрФУ, МГУ им. Ломоносова, Университета Гамбурга. В прошлом в проект внесли свой вклад исследователи из Южно-Уральского государственного университета, Дармштадского технического университета, Волверхемтонского университета и Университета Тренто.


snakers4 (Alexander), February 11, 06:29

Forwarded from Sava Kalbachou:

towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610

These are the Easiest Data Augmentation Techniques in Natural Language Processing you can think of — and they work.

Data augmentation is commonly used in computer vision. In vision, you can almost certainly flip, rotate, or mirror an image without risk…


snakers4 (Alexander), February 11, 06:22

Old news ... but Attention works

Funny enough, but in the past my models :

- Either did not need attention;

- Attention was implemented by @thinline72 ;

- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:

gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4

#nlp

#deep_learning

SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), February 08, 16:20

youtu.be/DMXvkbAtHNY

DeepMind’s AlphaStar Beats Humans 10-0 (or 1)
DeepMind's #AlphaStar blog post: deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/ Full event: www.youtube.com/watc...

snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;

spark-in.me/post/2019_ds_ml_digest_03

#digest

#data_science

#deep_learning

2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 04, 08:37

A new paradigm in ML?

jontysinai.github.io/jekyll/update/2019/01/18/understanding-neural-odes.html

#deep_learning

#odes

Understanding Neural ODE's

In this blogpost I explore how ODE’s can be used to solve data modelling problems. I take a deep dive into the data modelling problem at hand and present ODE’s (which model rates of change) as an a...


snakers4 (Alexander), January 31, 13:33

Forwarded from Анна:

Checked out sentence embeddings in LASER:

- installation guide is a bit messy

- works on FAISS lib, performance is pretty fast ( <1 minute to encode 250k sentences on 1080Ti)

- better generalization comparing to ft baseline. A difference is clear even for small sentences: 'добрый день!' and 'здравствуйте!' embeddings are much closer in LASER's space than in ft

- looks like LASER embeddings is more about similarity, not only substitutability and better in synonym's recognition

- seems to work better on short sentences

snakers4 (Alexander), January 31, 09:41

Second 2019 DS / ML digest

Highlight of the week - Facebook's LASER.

spark-in.me/post/2019_ds_ml_digest_02

#digest

#data_science

#deep_learning

2019 DS/ML digest 02

2019 DS/ML digest 02 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), January 31, 08:38

Jupiter widgets + pandas

towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6

With the @interact decorator, the IPywidgets library automatically gives us a text box and a slider for choosing a column and number! It looks at the inputs

Amazing.

#data_science

Interactive Controls in Jupyter Notebooks

How to use IPywidgets to enhance your data exploration and analysis


snakers4 (Alexander), January 30, 12:06

Serialization of large objects in Python

So far found no sane way for this with 1M chunks / 10GB+ object size.

Of course, chunking / plain txt works.

Feather / parquet - fail with 2+GB size.

Pickle works, but it is kind of slow.

=(

#data_science

snakers4 (Alexander), January 27, 17:28

youtu.be/-cOYwZ2XcAc

None of These Faces Are Real
The paper "A Style-Based Generator Architecture for Generative Adversarial Networks", i.e., #StyleGan and its video available here: arxiv.org/abs/181...

snakers4 (Alexander), January 25, 12:31

Downsides of using Common Crawl

Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.

Took a look at these - archives - data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.

You can use any form of CC - but only to learn word representations. Not sentences.

Sad.

#nlp

snakers4 (Alexander), January 24, 14:48

Neat PyTorch hack

(1) If possible Implement your complex loss / logic within your model.forward()

(2) Enjoy the multi-GPU / multi-node training wrappers from APEX, PyTorch DataParallel, DistributedDataParallel etc

=)

#deep_learning

snakers4 (Alexander), January 23, 11:26

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?

- Plain PyTorch 1.0 / numpy / FAISS based;

- [Release](code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](github.com/facebookresearch/LASER);

- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
sentences.
- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
The sentence encoder is implemented in PyTorch with minimal external dependencies.
Languages with limited resources can benefit from joint training over many languages.
The model supports the use of multiple languages in one sentence.
Performance improves as new languages are added, as the system learns to recognize characteristics of language families.
They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.

#nlp

#deep_learning

#

LASER natural language processing toolkit - Facebook Code

Our natural language processing toolkit, LASER, performs zero-shot cross-lingual transfer with more than 90 languages and is now open source.


snakers4 (Alexander), January 23, 08:12

Pre-trained BERT in PyTorch

github.com/huggingface/pytorch-pretrained-BERT

(1)

Model code here is just awesome.

Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.

FP16 precision training from APEX just works (no idea about convergence though yet).

(2)

As for model weights - I cannot really tell, there is no dedicated Russian model.

The only problem I am facing now - using large embeddings bags batch size is literally 1-4 even for smaller models.

And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.

(3)

Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).

(5)

Why even tackle such models?

Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.

Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).

#nlp

#deep_learning

huggingface/pytorch-pretrained-BERT

📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT


snakers4 (Alexander), January 21, 04:01

New amazing video by 3B1B

youtu.be/jsYwFizhncE

snakers4 (Alexander), January 15, 08:33

First 2019 DS / ML digest

No particular highlights - just maybe ML industrialization vector is here to stay?

spark-in.me/post/2019_ds_ml_digest_01

#digest

#deep_learning

#data_science

2019 DS/ML digest 01

2019 DS/ML digest 01 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), January 10, 09:49

Someone implemented instance weighted CE loss for PyTorch

gist.github.com/nasimrahaman/a5fb23f096d7b0c3880e1622938d0901

#deep_learning

Pytorch instance-wise weighted cross-entropy loss

Pytorch instance-wise weighted cross-entropy loss. GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), January 09, 09:15

Using nargs

Wrote about this a year ago.

Forgot about it, a friend reminded me.

You can pass lists to the python command line arguments.

parser.add_argument('--classifier_conf', default=[512, 2048, 5005], nargs='+', type=int)

and then just add params to your call as follows

--classifier_conf 512 2048 5005

#deep_learning

snakers4 (Alexander), January 08, 03:12

Forwarded from Sava Kalbachou:

techcrunch.com/2019/01/07/github-free-users-now-get-unlimited-private-repositories/?guccounter=1

GitHub Free users now get unlimited private repositories

If you’re a GitHub user, but you don’t pay, this is a good week. Historically, GitHub always offered free accounts but the caveat was that your code had to be public. To get private repositories, you had to pay. Starting tomorrow, that limitation is gone. Free GitHub users now get unlimited private projects with up […]


snakers4 (Alexander), January 04, 04:08

Linux subsystem in Windows 10

It works and installs in literally 2 clicks (run one command in Powershell and then just one-click install your Linux distro of choice in Windows Store (yes, this very funny indeed))!

Why would you need this?

To make and backup files on one command for example =)

Something like this becomes reality on Windows:

cd /mnt/d/ && \
TIME=`date +%b-%d-%y` && \
FILENAME=working_files_tar-$TIME.tar.gz && \
INCREMENTAL_FILE=backup_data.snar && \
echo 'Using folderlist' $FOLDERS && \
tar -czg $(<folders_backup.txt) --listed-incremental=$INCREMENTAL_FILE --verbose -f $FILENAME

Also, you may add rsync or scp and you are good to go!

Also other potential use cases:

- You are somehow vendor locked (I depend on proprietary drivers for my thunderbolt port to attach an external GPU) or just are used to Windows' windows (or are just lazy to install Linux);

- You need one particular Linux program or you need to quickly test something / do not want to bother replicating your environment under Windows (yes, you can also run Docker, but there will be some learning curve);

- You run all of your programs remotely, and use your Windows machine as a thin client, but sometimes you need git / bash / rsync - i.e. to download movies from your personal NAS;

#linux

snakers4 (Alexander), December 31, 2018

ML trends in 2019?

  • New TF API (6)
  • Pytorch 1.0+ mobile deploy (25)
  • NLP explosion (32)
  • GANs become mainstream (23)
  • CNNs / TCNs for tabular - mainstream (5)
  • RL becomes less fragile (16)
  • Finally a competitor to Nvidia? (9)
  • Nvlink or sth similar
  • End of hype (30)
  • Real deploy of cars (13)

159 votes

older first