Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1361 members, 1660 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «nlp»:

snakers4 (Alexander), December 17, 09:24


- PyText from Facebook:

- TLDR - FastText meets PyTorch;

- Very similar to AllenNLP in nature;

- Will be useful if you can afford to write modules for their framework to solve 100 identical tasks (i.e. like Facebook with 200 languages);

- In itself - seems to be too high maintenance to use;

I will not use use it.




A natural language modeling framework based on PyTorch - facebookresearch/pytext

snakers4 (Alexander), November 23, 08:21

TDS article follow-up

TDS also accepted a reprint of the article


Winning a CFT 2018 spelling correction competition

Or building a task-agnostic seq2seq pipeline on a challenging domain

snakers4 (Alexander), November 22, 11:56

Our victory in CFT-2018 competition


- Multi-task learning + seq2seq models rule;

- The domain seems to be easy, but it is not;

- You can also build a pipeline based on manual features, but it will not be task agnostic;

- Loss weighting is crucial for such tasks;

- Transformer trains 10x longer;




Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain Статьи автора - Блог -

snakers4 (Alexander), November 10, 10:30

Playing with Transformer

TLDR - use only pre-trained.

On classification tasks performed the same as classic models.

On seq2seq - much worse time / memory wise. Inference is faster though.


snakers4 (Alexander), November 09, 13:48

Fast-text trained on a random mix of Russian Wikipedia / Taiga / Common Crawl

On our benchmarks was marginally better than fast-text trained on Araneum from Rusvectors.

Download link


Standard params - (3,6) n-grams + vector dimensionality is 300.


import fastText as ft

ft_model_big = ft.load_model('model')And then just refer to


snakers4 (Alexander), November 03, 09:40

Building client routing / semantic search and clustering arbitrary external corpuses at

A brief executive summary about what we achieved at

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.




Building client routing / semantic search and clustering arbitrary external corpuses at

Building client routing / semantic search and clustering arbitrary external corpuses at Статьи автора - Блог -

snakers4 (Alexander), October 22, 05:43

Amazing articles about image hashing

Also a python library

- Library

- Articles:




A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.

Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models -

It is explained here - - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.



snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal:$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives

- Google group!topic/common-crawl/6F-yXsC35xM



Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - with -x

(3) Profit


snakers4 (Alexander), October 08, 10:11

A small continuation of the crawling saga

2 takes on the Common Crawl

It turned out to be a bit tougher than expected

But doable


Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - Блог -

snakers4 (Alexander), October 05, 16:47

Russian post on Habr

Please support if you have an account.


Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone cd wikiextractor wget http...

snakers4 (Alexander), October 04, 06:03

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):

ngrams = zip(*[string[i:] for i in range(n)])

ngrams = [''.join(_) for _ in ngrams]

return ngrams

string = 'грёзоблаженствующий'

ngrams = []

for i in range(3,7):


ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)

ft_ngrams = set(ft_ngrams)



snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

Please like / share / repost the article =)



Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - Блог -

snakers4 (Alexander), September 29, 10:48

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?


In case of Russian you can write here

The author will share her 90+GB RAW corpus with you


In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with





Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.

snakers4 (Alexander), September 26, 13:22

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)



snakers4 (Alexander), September 19, 09:59

Using sklearn pairwise cosine similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.


snakers4 (Alexander), September 14, 05:32

Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:

- Understanding attention

- Annotated transformer

- Illustrated transformer

Playing with transformer in practice

This repo turned out to be really helpful

It features:

- Decent well encapsulated model and loss;

- Several head for different tasks;

- It works;

- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:

- It works;

- It is high capacity;

- Inference time is ~`5x` higher than char-level or plain RNNs;

- It serves as a classifier as well as an LM;

- Capacity is enough to tackle most challenging tasks;

- It can be deployed on CPU for small texts (!);

- On smaller tasks there is no clear difference between plain RNNs and Transformer;


Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example. Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014). I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned…

snakers4 (Alexander), August 21, 13:31

2018 DS/ML digest 21




2018 DS/ML digest 21

2018 DS/ML digest 21 Статьи автора - Блог -

snakers4 (Alexander), August 16, 06:00

Google updates its transformer


Moving Beyond Translation with the Universal Transformer

Posted by Stephan Gouws, Research Scientist, Google Brain Team and Mostafa Dehghani, University of Amsterdam PhD student and Google Research...

snakers4 (Alexander), August 10, 11:39

Using numba

Looks like ... it just works when it works.

For example this cosine distance calculation function works ca 10x faster.

@numba.jit(target='cpu', nopython=True)

def fast_cosine(u, v):

m = u.shape[0]

udotv = 0

u_norm = 0

v_norm = 0

for i in range(m):

if (np.isnan(u[i])) or (np.isnan(v[i])):


udotv += u[i] * v[i]

u_norm += u[i] * u[i]

v_norm += v[i] * v[i]

u_norm = np.sqrt(u_norm)

v_norm = np.sqrt(v_norm)

if (u_norm == 0) or (v_norm == 0):

ratio = 1.0


ratio = udotv / (u_norm * v_norm)

return 1-ratioAlso looks like they recently were supported by NumFocus


Sponsored Projects | pandas, NumPy, Matplotlib, Jupyter, + more - NumFOCUS

Explore NumFOCUS Sponsored Projects, including: pandas, NumPy, Matplotlib, Jupyter, rOpenSci, Julia, Bokeh, PyMC3, Stan, nteract, SymPy, FEniCS, PyTables...

snakers4 (Alexander), August 06, 10:44

NLP - naive preprocessing

A friend has sent me a couple of gists



Useful boilerplate



GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), July 31, 05:18

Some interesting NLP related ideas from ACL 2018


- bag-of-embeddings is surprisingly good at capturing sentence-level properties, among other results

- language models are bad at modelling numerals and propose several strategies to improve them

- current state-of-the-art models fail to capture many simple inferences

- LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data

- Word embedding-based methods exhibit competitive or even superior performance

Four common ways to introduce linguistic information into models:

- Via a pipeline-based approach, where linguistic categories are used as features;

- Via data augmentation, where the data is augmented with linguistic categories;

- Via multi-task learning;


ACL 2018 Highlights: Understanding Representations

This post reviews two themes of ACL 2018: 1) gaining a better understanding what models capture and 2) to expose them to more challenging settings.

snakers4 (Alexander), June 25, 10:53

A subscriber sent a really decent CS university scientific ranking

Useful, if you want to apply for CS/ML based Ph.D. there


Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch





pytorch-openai-transformer-lm - A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

snakers4 (Alexander), April 17, 07:39

Nice realistic article about bias in embeddings by Google



Text Embedding Models Contain Bias. Here's Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we'll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.

snakers4 (Alexander), April 01, 07:57

Novel topic modelling techniques



Looks interesting.

If anyone knows about this - please ping in PM.


snakers4 (Alexander), March 26, 13:26

NLP project peculiarities

(0) Always handle new words somehow

(1) Easy evaluation of test results - you can just look at it

(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects

(0) Basic pipeline

(1) Basic preprocessing

- Stemming / lemmatization

- Regular expressions

(2) Naive / old school approaches that can just work

- Bag of Words => simple model

- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings

- Average / sum of Word2Vec embeddings

- Word2Vec * tf-idf >> Doc2Vec

- Small documents => embeddings work better

- Big documents => bag of features / high level features

(4) Sentiment analysis features


- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec



(6) Transfer learning in NLP - looks like it may become more popular / prominent

- Jeremy Howard's preprint on NLP transfer learning -



ML tutorial for NLP, Алексей Натекин

snakers4 (Alexander), March 26, 09:56

So, I have briefly watched Andrew Ng's series on RNNs.

It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.

Also he explains stuff with really simple and clear illustrations.

Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.

(which I did enough during his classic course)

Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:

(0) Key NLP sota achievements in 2017



(1) Consider courses and notebooks

(2) Consider NLP newsletter

(3) Consider excellent PyTorch tutorials

(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)

(5) Brief 1-hour intro to practical NLP

Also related posts on the channel / libraries:

(1) Pre-trained vectors in Russian - //

(2) How to learn about CTC loss // (when our seq2seq )

(3) Most popular MLP libraries for English - //

(4) NER in Russian -

(5) Lemmatization library in Russian - - recommended by a friend

Basic tasks considered more or less solved by RNNs

(1) Speech recognition / trigger word detection

(2) Music generation

(3) Sentiment analysis

(4) Machine translation

(5) Video activity recognition / tagging

(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:

(1) Different length of input and output

(2) Features for different positions in the sequence are not shared

(3) Enormous number of params

Typical word representations

(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)

(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs

(0) Typical architectures - deep GRU (lighter) and LSTM cells

(1) Tanh or RELU for hidden layer activation

(2) Sigmoid for output when classifying

(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens

(4) Usually word level models are used (not character level)

(5) Passing hidden state in encoder-decoder architectures

(6) Vanishing gradients - typically GRUs / LSTMs are used

(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)

(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)

(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass

(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state

(11) Finetune imported embeddings for smaller tasks with smaller datasets

(12) On big datasets - may make sense to learn embeddings from scratch

(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors

(0) Cosine (angle)

(1) Eucledian

Seminal papers / consctructs / ideas:

(1) Training embeddings - the later the methods came out - the simpler they are

- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)


-- useful ideas

-- if there is information - a simple model (i.e. logistic regression) will work

-- negative subsampling -

Advances in NLP in 2017

Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other…

sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words

-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update

-- skip-gram model in a nutshell -

- GloVe - Global Vectors (2014)


-- supposedly GloVe is better given same resources than Word2Vec -

-- in practice word vectors with 200 dimensions are enough for applied tasks

-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation

- essentially an exp of modified precision index for logs of 4 n-grams



(3) Attention is all you need


To be continued.





Captured with Lightshot

snakers4 (Alexander), March 20, 13:40

A video about realistic state of chat-bots (RU)




Neural Networks and Deep Learning lab at MIPT

При звонке в сервис или банк диалог с поддержкой строится по одному и тому же сценарию с небольшими вариациями. Отвечать «по бумажке» на вопросы может и не человек, а чат-бот. О том, как нейронные...

snakers4 (Alexander), February 28, 10:40

Forwarded from Data Science:

Most common libraries for Natural Language Processing:

CoreNLP from Stanford group:

NLTK, the most widely-mentioned NLP library for Python:

TextBlob, a user-friendly and intuitive NLTK interface:

Gensim, a library for document similarity analysis:

SpaCy, an industrial-strength NLP library built for performance:


#nlp #digest #libs

Stanford CoreNLP

High-performance human language analysis tools. Widely used, aavailable open source; written in Java.

snakers4 (Alexander), January 19, 12:34

Just found out about Facebook's fast text


Seems to be really promising




fastText - Library for fast text representation and classification.

snakers4 (Alexander), December 27, 11:00

Недавно поднимал вопрос работы с pre-trained embeddings.

До дела не дошло, но вот ссылки набрались полезные

- Работа с готовыми векторами для текста в Pytorch




- И еше ссылка на пост с векторами для русского языка

-- //





CNN_Sentence_Classification - pytorch Convolutional Networks for Sentence Classification -