Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1444 members, 1691 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «nlp»:

snakers4 (Alexander), February 13, 09:02

PyTorch NLP best practices

Very simple ideas, actually.

(1) Multi GPU parallelization and FP16 training

Do not bother reinventing the wheel.

Just use nvidia's apex, DistributedDataParallel, DataParallel.

Best examples [here](

(2) Put as much as possible INSIDE of the model

Implement the as much as possible of your logic inside of nn.module.


So that you can seamleassly you all the abstractions from (1) with ease.

Also models are more abstract and reusable in general.

(3) Why have a separate train/val loop?

PyTorch 0.4 introduced context handlers.

You can simplify your train / val / test loops, and merge them into one simple function.

context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()

if loop_type=='Train':


elif loop_type=='Val':


with context:

for i, (some_tensor) in enumerate(tqdm(train_loader)):

# do your stuff here


(4) EmbeddingBag

Use EmbeddingBag layer for morphologically rich languages. Seriously!

(5) Writing trainers / training abstractions

This is waste of time imho if you follow (1), (2) and (3).

(6) Nice bonus

If you follow most of these, you can train on as many GPUs and machines as you wan for any language)

(7) Using tensorboard for logging

This goes without saying.




A PyTorch implementation of Google AI's BERT model provided with Google's pre-trained models, examples and utilities. - huggingface/pytorch-pretrained-BERT

PyTorch DataLoader, GIL thrashing and CNNs

Well all of this seems a bit like magic to me, but hear me out.

I abused my GPU box for weeks running CNNs on 2-4 GPUs.

Nothing broke.

And then my GPU box started shutting down for no apparent reason.

No, this was not:

- CPU overheating (I have a massive cooler, I checked - it works);

- PSU;

- Overclocking;

- It also adds to confusion that AMD has weird temperature readings;

To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with workers > 0 it can lead to system instability instead of speeding up.

It is obvious in retrospect, but it is not when you face this issue.



snakers4 (Alexander), February 12, 05:13

Russian thesaurus that really works

It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!


Russian Distributional Thesaurus

Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество гиперонимов и инвентарь смыслов слов. Все ресурсы были построены автоматически на основании корпуса текстов книг на русском языке (12.9 млрд словоупотреблений). В следующих версиях ресурса планируется добавление и векторов смыслов слов для русского языка, которые были получены на основании того же корпуса текстов. Проект разрабатывается усилиями представителей УрФУ, МГУ им. Ломоносова, Университета Гамбурга. В прошлом в проект внесли свой вклад исследователи из Южно-Уральского государственного университета, Дармштадского технического университета, Волверхемтонского университета и Университета Тренто.

snakers4 (Alexander), February 11, 06:22

Old news ... but Attention works

Funny enough, but in the past my models :

- Either did not need attention;

- Attention was implemented by @thinline72 ;

- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;

It was the first time I / we tried manually building a model with plain self attention from scratch.

An you know - it really adds 5-10% to all of the tracked metrics.

Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications:



SelfAttention implementation in PyTorch

SelfAttention implementation in PyTorch. GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), January 25, 12:31

Downsides of using Common Crawl

Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.

Took a look at these - archives - - also only sentences, though they seem to be in logical order sometimes.

You can use any form of CC - but only to learn word representations. Not sentences.



snakers4 (Alexander), January 23, 11:26

NLP - Highlight of the week - LASER

- Hm, a new sentence embedding tool?

- Plain PyTorch 1.0 / numpy / FAISS based;

- [Release](, [library](;

- Looks like an off-shoot of their "unsupervised" NMT project;

LASER’s vector representations of sentences are generic with respect to both the

input language and the NLP task. The tool maps a sentence in any language to

point in a high-dimensional space with the goal that the same statement in any

language will end up in the same neighborhood. This representation could be seen

as a universal language in a semantic vector space. We have observed that the

distance in that space correlates very well to the semantic closeness of the

sentences.- Alleged pros:

It delivers extremely fast performance, processing up to 2,000 sentences per second on GPU.

The sentence encoder is implemented in PyTorch with minimal external dependencies.

Languages with limited resources can benefit from joint training over many languages.

The model supports the use of multiple languages in one sentence.

Performance improves as new languages are added, as the system learns to recognize characteristics of language families.They essentially trained an NMT model with a shared encoder for many languages.

I tried training sth similar - but it quickly over-fitted into just memorizing the indexes of words.




LASER natural language processing toolkit - Facebook Code

Our natural language processing toolkit, LASER, performs zero-shot cross-lingual transfer with more than 90 languages and is now open source.

snakers4 (Alexander), January 23, 08:12

Pre-trained BERT in PyTorch


Model code here is just awesome.

Integrated DataParallel / DDP wrappers / FP16 wrappers also are awesome.

FP16 precision training from APEX just works (no idea about convergence though yet).


As for model weights - I cannot really tell, there is no dedicated Russian model.

The only problem I am facing now - using large embeddings bags batch size is literally 1-4 even for smaller models.

And training models with sentence piece is kind of feasible for rich languages, but you will always worry about generalization.


Did not try the generative pre-training (and sentence prediction pre-training), I hope that properly initializing embeddings will also work for a closed domain with a smaller model (they pre-train 4 days on 4+ TPUs, lol).


Why even tackle such models?

Chat / dialogue / machine comprehension models are complex / require one-off feature engineering.

Being able to tune something like BERT on publicly available benchmarks and then on your domain can provide a good way to embed complex situations (like questions in dialogues).




A PyTorch implementation of Google AI's BERT model provided with Google's pre-trained models, examples and utilities. - huggingface/pytorch-pretrained-BERT

snakers4 (Alexander), December 27, 04:54

snakers4 (Alexander), December 20, 12:12

Spell-checking on various scales in Russian

Bayes + n-gram rules = spell-checker for words / sentences


Исправляем опечатки в поисковых запросах

Наверное, любой сервис, на котором вообще есть поиск, рано или поздно приходит к потребности научиться исправлять ошибки в пользовательских запросах. Errare...

Neural Information Processing Systems

Welcome to NeurIPS 2018 Turorial Sessions. This tutorial on Visualization for Machine Learning will provide an introduction to the landscape of ML visualizaions, organized by types of users and their...

snakers4 (Alexander), December 17, 09:24


- PyText from Facebook:

- TLDR - FastText meets PyTorch;

- Very similar to AllenNLP in nature;

- Will be useful if you can afford to write modules for their framework to solve 100 identical tasks (i.e. like Facebook with 200 languages);

- In itself - seems to be too high maintenance to use;

I will not use use it.




A natural language modeling framework based on PyTorch - facebookresearch/pytext

snakers4 (Alexander), November 23, 08:21

TDS article follow-up

TDS also accepted a reprint of the article


Winning a CFT 2018 spelling correction competition

Or building a task-agnostic seq2seq pipeline on a challenging domain

snakers4 (Alexander), November 22, 11:56

Our victory in CFT-2018 competition


- Multi-task learning + seq2seq models rule;

- The domain seems to be easy, but it is not;

- You can also build a pipeline based on manual features, but it will not be task agnostic;

- Loss weighting is crucial for such tasks;

- Transformer trains 10x longer;




Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain Статьи автора - Блог -

snakers4 (Alexander), November 10, 10:30

Playing with Transformer

TLDR - use only pre-trained.

On classification tasks performed the same as classic models.

On seq2seq - much worse time / memory wise. Inference is faster though.


snakers4 (Alexander), November 09, 13:48

Fast-text trained on a random mix of Russian Wikipedia / Taiga / Common Crawl

On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.

Download link


Standard params - (3,6) n-grams + vector dimensionality is 300.


import fastText as ft

ft_model_big = ft.load_model('model')And then just refer to


snakers4 (Alexander), November 03, 09:40

Building client routing / semantic search and clustering arbitrary external corpuses at

A brief executive summary about what we achieved at

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.




Building client routing / semantic search and clustering arbitrary external corpuses at

Building client routing / semantic search and clustering arbitrary external corpuses at Статьи автора - Блог -

snakers4 (Alexander), October 22, 05:43

Amazing articles about image hashing

Also a python library

- Library

- Articles:




A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.

Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models -

It is explained here - - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.



snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal:$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives

- Google group!topic/common-crawl/6F-yXsC35xM



Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - with -x

(3) Profit


snakers4 (Alexander), October 08, 10:11

A small continuation of the crawling saga

2 takes on the Common Crawl

It turned out to be a bit tougher than expected

But doable


Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - Блог -

snakers4 (Alexander), October 05, 16:47

Russian post on Habr

Please support if you have an account.


Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone cd wikiextractor wget http...

snakers4 (Alexander), October 04, 06:03

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):

ngrams = zip(*[string[i:] for i in range(n)])

ngrams = [''.join(_) for _ in ngrams]

return ngrams

string = 'грёзоблаженствующий'

ngrams = []

for i in range(3,7):


ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)

ft_ngrams = set(ft_ngrams)



snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

Please like / share / repost the article =)



Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - Блог -

snakers4 (Alexander), September 29, 10:48

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?


In case of Russian you can write here

The author will share her 90+GB RAW corpus with you


In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with





Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.

snakers4 (Alexander), September 26, 13:22

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)



snakers4 (Alexander), September 19, 09:59

Using sklearn pairwise cosine similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.


snakers4 (Alexander), September 14, 05:32

Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:

- Understanding attention

- Annotated transformer

- Illustrated transformer

Playing with transformer in practice

This repo turned out to be really helpful

It features:

- Decent well encapsulated model and loss;

- Several head for different tasks;

- It works;

- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:

- It works;

- It is high capacity;

- Inference time is ~`5x` higher than char-level or plain RNNs;

- It serves as a classifier as well as an LM;

- Capacity is enough to tackle most challenging tasks;

- It can be deployed on CPU for small texts (!);

- On smaller tasks there is no clear difference between plain RNNs and Transformer;


Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example. Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014). I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned…

snakers4 (Alexander), August 21, 13:31

2018 DS/ML digest 21




2018 DS/ML digest 21

2018 DS/ML digest 21 Статьи автора - Блог -

snakers4 (Alexander), August 16, 06:00

Google updates its transformer


Moving Beyond Translation with the Universal Transformer

Posted by Stephan Gouws, Research Scientist, Google Brain Team and Mostafa Dehghani, University of Amsterdam PhD student and Google Research...

snakers4 (Alexander), August 10, 11:39

Using numba

Looks like ... it just works when it works.

For example this cosine distance calculation function works ca 10x faster.

@numba.jit(target='cpu', nopython=True)

def fast_cosine(u, v):

m = u.shape[0]

udotv = 0

u_norm = 0

v_norm = 0

for i in range(m):

if (np.isnan(u[i])) or (np.isnan(v[i])):


udotv += u[i] * v[i]

u_norm += u[i] * u[i]

v_norm += v[i] * v[i]

u_norm = np.sqrt(u_norm)

v_norm = np.sqrt(v_norm)

if (u_norm == 0) or (v_norm == 0):

ratio = 1.0


ratio = udotv / (u_norm * v_norm)

return 1-ratioAlso looks like they recently were supported by NumFocus


Sponsored Projects | pandas, NumPy, Matplotlib, Jupyter, + more - NumFOCUS

Explore NumFOCUS Sponsored Projects, including: pandas, NumPy, Matplotlib, Jupyter, rOpenSci, Julia, Bokeh, PyMC3, Stan, nteract, SymPy, FEniCS, PyTables...

snakers4 (Alexander), August 06, 10:44

NLP - naive preprocessing

A friend has sent me a couple of gists



Useful boilerplate



GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), July 31, 05:18

Some interesting NLP related ideas from ACL 2018


- bag-of-embeddings is surprisingly good at capturing sentence-level properties, among other results

- language models are bad at modelling numerals and propose several strategies to improve them

- current state-of-the-art models fail to capture many simple inferences

- LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data

- Word embedding-based methods exhibit competitive or even superior performance

Four common ways to introduce linguistic information into models:

- Via a pipeline-based approach, where linguistic categories are used as features;

- Via data augmentation, where the data is augmented with linguistic categories;

- Via multi-task learning;


ACL 2018 Highlights: Understanding Representations

This post reviews two themes of ACL 2018: 1) gaining a better understanding what models capture and 2) to expose them to more challenging settings.

snakers4 (Alexander), June 25, 10:53

A subscriber sent a really decent CS university scientific ranking

Useful, if you want to apply for CS/ML based Ph.D. there


Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch





pytorch-openai-transformer-lm - A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI