Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1812 members, 1759 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «nlp»:

snakers4 (Alexander), September 14, 2018

Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:

- Understanding attention

- Annotated transformer

- Illustrated transformer

Playing with transformer in practice

This repo turned out to be really helpful

It features:

- Decent well encapsulated model and loss;

- Several head for different tasks;

- It works;

- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:

- It works;

- It is high capacity;

- Inference time is ~`5x` higher than char-level or plain RNNs;

- It serves as a classifier as well as an LM;

- Capacity is enough to tackle most challenging tasks;

- It can be deployed on CPU for small texts (!);

- On smaller tasks there is no clear difference between plain RNNs and Transformer;


Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

Translations: Chinese (Simplified), Korean Watch: MIT’s Deep Learning State of the Art lecture referencing this post May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example. Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014). I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding…

snakers4 (Alexander), August 21, 2018

2018 DS/ML digest 21




2018 DS/ML digest 21

2018 DS/ML digest 21 Статьи автора - Блог -

snakers4 (Alexander), August 16, 2018

Google updates its transformer


Moving Beyond Translation with the Universal Transformer

Posted by Stephan Gouws, Research Scientist, Google Brain Team and Mostafa Dehghani, University of Amsterdam PhD student and Google Research...

snakers4 (Alexander), August 10, 2018

Using numba

Looks like ... it just works when it works.

For example this cosine distance calculation function works ca 10x faster.

@numba.jit(target='cpu', nopython=True)
def fast_cosine(u, v):
m = u.shape[0]
udotv = 0
u_norm = 0
v_norm = 0
for i in range(m):
if (np.isnan(u[i])) or (np.isnan(v[i])):

udotv += u[i] * v[i]
u_norm += u[i] * u[i]
v_norm += v[i] * v[i]

u_norm = np.sqrt(u_norm)
v_norm = np.sqrt(v_norm)

if (u_norm == 0) or (v_norm == 0):
ratio = 1.0
ratio = udotv / (u_norm * v_norm)
return 1-ratio
Also looks like they recently were supported by NumFocus


Sponsored Projects | pandas, NumPy, Matplotlib, Jupyter, + more - NumFOCUS

Explore NumFOCUS Sponsored Projects, including: pandas, NumPy, Matplotlib, Jupyter, rOpenSci, Julia, Bokeh, PyMC3, Stan, nteract, SymPy, FEniCS, PyTables...

snakers4 (Alexander), August 06, 2018

NLP - naive preprocessing

A friend has sent me a couple of gists



Useful boilerplate



GitHub Gist: instantly share code, notes, and snippets.

snakers4 (Alexander), July 31, 2018

Some interesting NLP related ideas from ACL 2018


- bag-of-embeddings is surprisingly good at capturing sentence-level properties, among other results

- language models are bad at modelling numerals and propose several strategies to improve them

- current state-of-the-art models fail to capture many simple inferences

- LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data

- Word embedding-based methods exhibit competitive or even superior performance

Four common ways to introduce linguistic information into models:

- Via a pipeline-based approach, where linguistic categories are used as features;

- Via data augmentation, where the data is augmented with linguistic categories;

- Via multi-task learning;


ACL 2018 Highlights: Understanding Representations

This post discusses highlights of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018). It focuses on understanding representations and evaluating in more challenging scenarios.

snakers4 (Alexander), June 25, 2018

A subscriber sent a really decent CS university scientific ranking

Useful, if you want to apply for CS/ML based Ph.D. there


Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch





🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI - huggingface/pytorch-openai-transformer-lm

snakers4 (Alexander), April 17, 2018

Nice realistic article about bias in embeddings by Google



Text Embedding Models Contain Bias. Here's Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we'll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.

snakers4 (Alexander), April 01, 2018

Novel topic modelling techniques



Looks interesting.

If anyone knows about this - please ping in PM.


snakers4 (Alexander), March 26, 2018

NLP project peculiarities

(0) Always handle new words somehow

(1) Easy evaluation of test results - you can just look at it

(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects

(0) Basic pipeline

(1) Basic preprocessing

- Stemming / lemmatization

- Regular expressions

(2) Naive / old school approaches that can just work

- Bag of Words => simple model

- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings

- Average / sum of Word2Vec embeddings

- Word2Vec * tf-idf >> Doc2Vec

- Small documents => embeddings work better

- Big documents => bag of features / high level features

(4) Sentiment analysis features


- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec



(6) Transfer learning in NLP - looks like it may become more popular / prominent

- Jeremy Howard's preprint on NLP transfer learning -



ML tutorial for NLP, Алексей Натекин
Open Data Science December Meetup – 20.12.2017 Партнеры митапа – Юрий Мельничек, компании Microsoft и Mapbox. Задачи с анализом текста плотно вошли в нашу жи...

snakers4 (Alexander), March 26, 2018

So, I have briefly watched Andrew Ng's series on RNNs.

It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.

Also he explains stuff with really simple and clear illustrations.

Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.

(which I did enough during his classic course)

Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:

(0) Key NLP sota achievements in 2017



(1) Consider courses and notebooks

(2) Consider NLP newsletter

(3) Consider excellent PyTorch tutorials

(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)

(5) Brief 1-hour intro to practical NLP

Also related posts on the channel / libraries:

(1) Pre-trained vectors in Russian -

(2) How to learn about CTC loss (when our seq2seq )

(3) Most popular MLP libraries for English -

(4) NER in Russian -

(5) Lemmatization library in Russian - - recommended by a friend

Basic tasks considered more or less solved by RNNs

(1) Speech recognition / trigger word detection

(2) Music generation

(3) Sentiment analysis

(4) Machine translation

(5) Video activity recognition / tagging

(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:

(1) Different length of input and output

(2) Features for different positions in the sequence are not shared

(3) Enormous number of params

Typical word representations

(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)

(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs

(0) Typical architectures - deep GRU (lighter) and LSTM cells

(1) Tanh or RELU for hidden layer activation

(2) Sigmoid for output when classifying

(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens

(4) Usually word level models are used (not character level)

(5) Passing hidden state in encoder-decoder architectures

(6) Vanishing gradients - typically GRUs / LSTMs are used

(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)

(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)

(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass

(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state

(11) Finetune imported embeddings for smaller tasks with smaller datasets

(12) On big datasets - may make sense to learn embeddings from scratch

(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors

(0) Cosine (angle)

(1) Eucledian

Seminal papers / consctructs / ideas:

(1) Training embeddings - the later the methods came out - the simpler they are

- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)


-- useful ideas

-- if there is information - a simple model (i.e. logistic regression) will work

-- negative subsampling -

Advances in NLP in 2017

Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other…

sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words

-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update

-- skip-gram model in a nutshell -

- GloVe - Global Vectors (2014)


-- supposedly GloVe is better given same resources than Word2Vec -

-- in practice word vectors with 200 dimensions are enough for applied tasks

-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation

- essentially an exp of modified precision index for logs of 4 n-grams



(3) Attention is all you need


To be continued.





Captured with Lightshot

snakers4 (Alexander), March 20, 2018

A video about realistic state of chat-bots (RU)




Neural Networks and Deep Learning lab at MIPT

При звонке в сервис или банк диалог с поддержкой строится по одному и тому же сценарию с небольшими вариациями. Отвечать «по бумажке» на вопросы может и не человек, а чат-бот. О том, как нейронные...

snakers4 (Alexander), February 28, 2018

Forwarded from Data Science:

Most common libraries for Natural Language Processing:

CoreNLP from Stanford group:

NLTK, the most widely-mentioned NLP library for Python:

TextBlob, a user-friendly and intuitive NLTK interface:

Gensim, a library for document similarity analysis:

SpaCy, an industrial-strength NLP library built for performance:


#nlp #digest #libs

Stanford CoreNLP

High-performance human language analysis tools. Widely used, available open source; written in Java.

snakers4 (Alexander), January 19, 2018

Just found out about Facebook's fast text


Seems to be really promising




Library for fast text representation and classification. - facebookresearch/fastText

snakers4 (Alexander), December 27, 2017

Недавно поднимал вопрос работы с pre-trained embeddings.

До дела не дошло, но вот ссылки набрались полезные

- Работа с готовыми векторами для текста в Pytorch




- И еше ссылка на пост с векторами для русского языка






pytorch Convolutional Networks for Sentence Classification - - A-Jacobson/CNN_Sentence_Classification

snakers4 (Alexander), December 15, 2017

Знакомый посоветовал огромную базу с корпусами и моделями векторными для русского языка.

Стильно, модно молодежно





Раньше я думал, что такого особо нет нигде.



RusVectōrēs: семантические модели для русского языка

РусВекторес: дистрибутивная семантика для русского языка, веб-интерфейс и модели для скачивания

snakers4 (Alexander), November 02, 2017

New AI Grant Fellows

New AI Grant Fellows

AI Grant is a decentralized AI lab. We fund brilliant minds around the world to work on AI research.

Опен сорсная библиотека для выделения векторов из текста



gensim: topic modelling for humans

Efficient topic modelling in Python

snakers4 (Alexander), September 18, 2017

В процессе поиска решения для задачи А сюда - - натолкнулся на ряд полезных ссылок:

- Настройка параметров XGBoost -

- XGBoost на GPU из коробки -

Пока использование embeddings дало +5% к бейслайну и 18 место из 50 человек. Интуиция подсказывает, что оптимальное решение лежит где-то в сфере смеси авто-энкодеров/декодеров и LTSM.

- ipynb -

- html -



older first