Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1812 members, 1759 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf

Posts by tag «nlp»:

snakers4 (Alexander), August 11, 04:42

Extreme NLP network miniaturization

Tried some plain RNNs on a custom in the wild NER task.

The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.

I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.

What is interesting:

- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;

- Model works with various hidden sizes

- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;

- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;

As added bonus - you can just store such miniature model in git w/o lfs.

What is with training transformers on US$250k worth of compute credits you say?)

#nlp

#data_science

#deep_learning

A new model for word embeddings that are resilient to misspellings

Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.


snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page

archive.org/details/twitterstream?and[]=year%3A%222018%22

Donate them here

archive.org/donate/

#data_science

#nlp

#nlp

Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...


snakers4 (Alexander), March 31, 12:44

Miniaturize / optimize your ... NLP models?

For CV applications there literally dozens of ways to make your models smaller.

And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).

I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:

- Smaller images (x3-x4 easy);

- FP16 inference (30-40% maybe);

- Knowledge distillation into smaller networks (x3-x10);

- Naïve cascade optimizations (feed only Nth frame using some heuristic);

But what can you do with NLP networks?

Turns out not much.

But here are my ideas:

- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;

- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;

- FP16 inference is supported in PyTorch for nn.Embedding, but not for nn.EmbeddingBag. But you get the idea;

_embedding_bag is not implemented for type torch.HalfTensor

- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;

- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;

#nlp

#deep_learning

snakers4 (Alexander), March 26, 04:44

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.

Luckily, some anonymous backed the dataset up.

Anyway - use it.

Yeah, it is small. But it is free, so whatever.

#nlp

#data_science

Download Dataset.tar.gz 1.57 MB

snakers4 (Alexander), March 21, 11:15

Normalization techniques other than batch norm:

(pics.spark-in.me/upload/aecc2c5fb356b6d803b4218fcb0bc3ec.png)

Weight normalization (used in TCN arxiv.org/abs/1602.07868):

- Decouples length of weight vectors from their direction;

- Does not introduce any dependencies between the examples in a minibatch;

- Can be applied successfully to recurrent models such as LSTMs;

- Tested only on small datasets (CIFAR + VAES + DQN);

Instance norm (used in [style transfer](arxiv.org/abs/1607.08022))

- Proposed for style transfer;

- Essentially is batch-norm for one image;

- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;

Layer norm (used in Transformers, [paper](arxiv.org/abs/1607.06450))

- Designed especially for sequntial networks;

- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;

- The mean and standard-deviation are calculated separately over the last certain number dimensions;

- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;

#deep_learning

#nlp

snakers4 (Alexander), March 12, 15:45

Our Transformer post was featured by Towards Data Science

medium.com/p/complexity-generalization-computational-cost-in-nlp-modeling-of-morphologically-rich-languages-7fa2c0b45909?source=email-f29885e9bef3--writer.postDistributed&sk=a56711f1436d60283d4b672466ba258b

#nlp

Comparing complex NLP models for complex languages on a set of real tasks

Transformer is not yet really usable in practice for languages with rich morphology, but we take the first step in this direction


snakers4 (Alexander), February 27, 13:02

Dependency parsing and POS tagging in Russian

Less popular set of NLP tasks.

Popular tools reviewed

habr.com/ru/company/sberbank/blog/418701/

Only morphology:

(0) Well known pymorphy2 package;

Only POS tags and morphology:

(0) github.com/IlyaGusev/rnnmorph (easy to use);

(1) github.com/nlpub/pymystem3 (easy to use);

Full dependency parsing

(0) Russian spacy plugin:

- github.com/buriy/spacy-ru - installation

- github.com/buriy/spacy-ru/blob/master/examples/POS_and_syntax.ipynb - usage with examples

(1) Malt parser based solution (drawback - no examples)

- github.com/oxaoo/mp4ru

(2) Google's syntaxnet

- github.com/tensorflow/models/tree/master/research/syntaxnet

#nlp

Изучаем синтаксические парсеры для русского языка

Привет! Меня зовут Денис Кирьянов, я работаю в Сбербанке и занимаюсь проблемами обработки естественного языка (NLP). Однажды нам понадобилось выбрать синтаксичес...


snakers4 (Alexander), February 13, 09:02

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.

Just use nvidia's apex, DistributedDataParallel, DataParallel.

Best examples [here](github.com/huggingface/pytorch-pretrained-BERT).

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.

Why?

So that you can seamleassly you all the abstractions from (1) with ease.

Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':
model.train()
elif loop_type=='Val':
model.eval()

with context:
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here
pass

(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.

#nlp

#deep_learning

huggingface/pytorch-transformers

👾 A library of state-of-the-art pretrained models for Natural Language Processing (NLP) - huggingface/pytorch-transformers


PyTorch DataLoader, GIL thrashing and CNNs

Well all of this seems a bit like magic to me, but hear me out.

I abused my GPU box for weeks running CNNs on 2-4 GPUs.

Nothing broke.

And then my GPU box started shutting down for no apparent reason.

No, this was not:

- CPU overheating (I have a massive cooler, I checked - it works);

- PSU;

- Overclocking;

- It also adds to confusion that AMD has weird temperature readings;

To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with workers > 0 it can lead to system instability instead of speeding up.

It is obvious in retrospect, but it is not when you face this issue.

#deep_learning

#pytorch

snakers4 (Alexander), February 12, 05:13

Russian thesaurus that really works

nlpub.ru/Russian_Distributional_Thesaurus#.D0.93.D1.80.D0.B0.D1.84_.D0.BF.D0.BE.D0.B4.D0.BE.D0.B1.D0.B8.D1.8F_.D1.81.D0.BB.D0.BE.D0.B2

It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!

#nlp

Russian Distributional Thesaurus

Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество гиперонимов и инвентарь смыслов слов. Все ресурсы были построены автоматически на основании корпуса текстов книг на русском языке (12.9 млрд словоупотреблений). В следующих версиях ресурса планируется добавление и векторов смыслов слов для русского языка, которые были получены на основании того же корпуса текстов. Проект разрабатывается усилиями представителей УрФУ, МГУ им. Ломоносова, Университета Гамбурга. В прошлом в проект внесли свой вклад исследователи из Южно-Уральского государственного университета, Дармштадского технического университета, Волверхемтонского университета и Университета Тренто.


snakers4 (Alexander), February 11, 06:22

Old news ... but Attention works

Funny enough, but in the past my models :

- Either did not need attention;

- Attention was implemented by @thinline72 ;

- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:

gist.github.com/cbaziotis/94e53bdd6e4852756e0395560ff38aa4

#nlp

#deep_learning

SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), January 25, 12:31

Downsides of using Common Crawl

Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.

Took a look at these - archives - data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.

You can use any form of CC - but only to learn word representations. Not sentences.

Sad.

#nlp

snakers4 (Alexander), January 23, 11:26

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?

- Plain PyTorch 1.0 / numpy / FAISS based;

- [Release](code.fb.com/ai-research/laser-multilingual-sentence-embeddings/), [library](github.com/facebookresearch/LASER);

- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the
input language and the NLP task. The tool maps a sentence in any language to
point in a high-dimensional space with the goal that the same statement in any
language will end up in the same neighborhood. This representation could be seen
as a universal language in a semantic vector space. We have observed that the
distance in that space correlates very well to the semantic closeness of the
sentences.
- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.
The sentence encoder is implemented in PyTorch with minimal external dependencies.
Languages with limited resources can benefit from joint training over many languages.
The model supports the use of multiple languages in one sentence.
Performance improves as new languages are added, as the system learns to recognize characteristics of language families.
They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.

#nlp

#deep_learning

#

LASER natural language processing toolkit - Facebook Code

Our natural language processing toolkit, LASER, performs zero-shot cross-lingual transfer with more than 90 languages and is now open source.


snakers4 (Alexander), January 23, 08:12

Pre-trained BERT in PyTorch

github.com/huggingface/pytorch-pretrained-BERT

(1)

Model code here is just awesome.

Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.

FP16 precision training from APEX just works (no idea about convergence though yet).

(2)

As for model weights - I cannot really tell, there is no dedicated Russian model.

The only problem I am facing now - using large embeddings bags batch size is literally 1-4 even for smaller models.

And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.

(3)

Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).

(5)

Why even tackle such models?

Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.

Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).

#nlp

#deep_learning

huggingface/pytorch-pretrained-BERT

📖The Big-&-Extending-Repository-of-Transformers: Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google/CMU Transformer-XL. - huggingface/pytorch-pretrained-BERT


snakers4 (Alexander), December 27, 2018

snakers4 (Alexander), December 20, 2018

Spell-checking on various scales in Russian

Bayes + n-gram rules = spell-checker for words / sentences

habr.com/company/joom/blog/433554/

#nlp

Исправляем опечатки в поисковых запросах

Наверное, любой сервис, на котором вообще есть поиск, рано или поздно приходит к потребности научиться исправлять ошибки в пользовательских запросах. Errare...


www.facebook.com/nipsfoundation/videos/203530960558001

Neural Information Processing Systems

Welcome to NeurIPS 2018 Turorial Sessions. This tutorial on Visualization for Machine Learning will provide an introduction to the landscape of ML visualizaions, organized by types of users and their...


snakers4 (Alexander), December 17, 2018

PyText

- PyText github.com/facebookresearch/pytext from Facebook:

- TLDR - FastText meets PyTorch;

- Very similar to AllenNLP in nature;

- Will be useful if you can afford to write modules for their framework to solve 100 identical tasks (i.e. like Facebook with 200 languages);

- In itself - seems to be too high maintenance to use;

I will not use use it.

#nlp

#deep_learning

facebookresearch/pytext

A natural language modeling framework based on PyTorch - facebookresearch/pytext


snakers4 (Alexander), November 23, 2018

TDS article follow-up

TDS also accepted a reprint of the article

towardsdatascience.com/winning-a-cft-2018-spelling-correction-competition-b771d0c1b9f6

#nlp

Winning a CFT 2018 spelling correction competition

Or building a task-agnostic seq2seq pipeline on a challenging domain


snakers4 (Alexander), November 22, 2018

Our victory in CFT-2018 competition

TLDR

- Multi-task learning + seq2seq models rule;

- The domain seems to be easy, but it is not;

- You can also build a pipeline based on manual features, but it will not be task agnostic;

- Loss weighting is crucial for such tasks;

- Transformer trains 10x longer;

spark-in.me/post/cft-spelling-2018

#nlp

#deep_learning

#data_science

Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 10, 2018

Playing with Transformer

TLDR - use only pre-trained.

On classification tasks performed the same as classic models.

On seq2seq - much worse time / memory wise. Inference is faster though.

#nlp

snakers4 (Alexander), November 09, 2018

Fast-text trained on a random mix of Russian Wikipedia / Taiga / Common Crawl

On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.

Download link

goo.gl/g6HmLU

Params

Standard params - (3,6) n-grams + vector dimensionality is 300.

Usage:

import fastText as ft
ft_model_big = ft.load_model('model')
And then just refer to

github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py

#nlp

snakers4 (Alexander), November 03, 2018

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

A brief executive summary about what we achieved at Profi.ru.

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.

spark-in.me/post/profi-ru-semantic-search-project

#nlp

#data_science

#deep_learning

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 22, 2018

Amazing articles about image hashing

Also a python library

- Library github.com/JohannesBuchner/imagehash

- Articles:

fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

#data_science

#computer_vision

JohannesBuchner/imagehash

A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.


Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models - torchtext.readthedocs.io.

It is explained here - bastings.github.io/annotated_encoder_decoder/ - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.

#nlp

#deep_learning

snakers4 (Alexander), October 10, 2018

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science

snakers4 (Alexander), October 08, 2018

A small continuation of the crawling saga

2 takes on the Common Crawl

spark-in.me/post/parsing-common-crawl-in-four-simple-commands

spark-in.me/post/parsing-common-crawl-in-two-simple-commands

It turned out to be a bit tougher than expected

But doable

#nlp

Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 05, 2018

Russian post on Habr

habr.com/post/425507/

Please support if you have an account.

#nlp

Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget http...


snakers4 (Alexander), October 04, 2018

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
return ngrams

string = 'грёзоблаженствующий'

ngrams = []
for i in range(3,7):
ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))
print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))
#nlp

snakers4 (Alexander), October 03, 2018

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp

medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee

Please like / share / repost the article =)

#nlp

#data_science

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 29, 2018

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)

In case of Russian you can write here

tatianashavrina.github.io/taiga_site/

The author will share her 90+GB RAW corpus with you

(2)

In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with

- commoncrawl.org/connect/blog/

- commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

- www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.


snakers4 (Alexander), September 26, 2018

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)

wget http://rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp

snakers4 (Alexander), September 19, 2018

Using sklearn pairwise cosine similarity

scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.

#nlp

older first