Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1356 members, 1614 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

Posts by tag «nlp»:

snakers4 (Alexander), October 10, 18:31

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science

snakers4 (Alexander), October 08, 10:11

A small continuation of the crawling saga

2 takes on the Common Crawl

spark-in.me/post/parsing-common-crawl-in-four-simple-commands

spark-in.me/post/parsing-common-crawl-in-two-simple-commands

It turned out to be a bit tougher than expected

But doable

#nlp

Parsing Common Crawl in 4 plain scripts in python

Parsing Common Crawl in 4 plain scripts in python Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 05, 16:47

Russian post on Habr

habr.com/post/425507/

Please support if you have an account.

#nlp

Парсим Википедию для задач NLP в 4 команды

Суть Оказывается для этого достаточно запуcтить всего лишь такой набор команд: git clone https://github.com/attardi/wikiextractor.git cd wikiextractor wget http...


snakers4 (Alexander), October 04, 06:03

Amazingly simple code to mimic fast-texts n-gram subword routine

Nuff said. Try it for yourself.

from fastText import load_model

model = load_model('official_fasttext_wiki_200_model')

def find_ngrams(string, n):

ngrams = zip(*[string[i:] for i in range(n)])

ngrams = [''.join(_) for _ in ngrams]

return ngrams

string = 'грёзоблаженствующий'

ngrams = []

for i in range(3,7):

ngrams.extend(find_ngrams('<'+string+'>',i))

ft_ngrams, ft_indexes = model.get_subwords(string)

ngrams = set(ngrams)

ft_ngrams = set(ft_ngrams)

print(sorted(ngrams),sorted(ft_ngrams))

print(set(ft_ngrams).difference(set(ngrams)),set(ngrams).difference(set(ft_ngrams)))#nlp

snakers4 (Alexander), October 03, 18:16

Parsing Wikipedia in 4 plain commands in Python

Wrote a small take on using Wikipedia as corpus for NLP.

spark-in.me/post/parsing-wikipedia-in-four-commands-for-nlp

medium.com/@aveysov/parsing-wikipedia-in-4-simple-commands-for-plain-nlp-corpus-retrieval-eee66b3ba3ee

Please like / share / repost the article =)

#nlp

#data_science

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval

Parsing Wikipedia in 4 simple commands for plain NLP corpus retrieval Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), September 29, 10:48

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)

In case of Russian you can write here

tatianashavrina.github.io/taiga_site/

The author will share her 90+GB RAW corpus with you

(2)

In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with

- commoncrawl.org/connect/blog/

- commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

- www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.


snakers4 (Alexander), September 26, 13:22

Araneum russicum maximum

TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.

Pre-processed version can be downloaded from rusvectores.

Afaik, this link is not yet on their website (?)

wget rusvectores.org/static/rus_araneum_maxicum.txt.gz

#nlp

snakers4 (Alexander), September 19, 09:59

Using sklearn pairwise cosine similarity

scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html#sklearn.metrics.pairwise.cosine_similarity

On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:

- In 10 processes;

- Using numba;

The more you know.

If you have used it - please PM me.

#nlp

snakers4 (Alexander), September 14, 05:32

Understanding the current SOTA NMT / NLP model - transformer

A list of articles that really help to do so:

- Understanding attention jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

- Annotated transformer nlp.seas.harvard.edu/2018/04/03/attention.html

- Illustrated transformer jalammar.github.io/illustrated-transformer/

Playing with transformer in practice

This repo turned out to be really helpful

github.com/huggingface/pytorch-openai-transformer-lm

It features:

- Decent well encapsulated model and loss;

- Several head for different tasks;

- It works;

- Ofc their data-loading scheme is crappy and over-engineered;

My impressions on actually training the transformer model for classification:

- It works;

- It is high capacity;

- Inference time is ~`5x` higher than char-level or plain RNNs;

- It serves as a classifier as well as an LM;

- Capacity is enough to tackle most challenging tasks;

- It can be deployed on CPU for small texts (!);

- On smaller tasks there is no clear difference between plain RNNs and Transformer;

#nlp

Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)

May 25th update: New graphics (RNN animation, word embedding graph), color coding, elaborated on the final attention example. Note: The animations below are videos. Touch or hover on them (if you’re using a mouse) to get play controls so you can pause if needed. Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014). I found, however, that understanding the model well enough to implement it requires unraveling a series of concepts that build on top of each other. I thought that a bunch of these ideas would be more accessible if expressed visually. That’s what I aim to do in this post. You’ll need some previous understanding of deep learning to get through this post. I hope it can be a useful companion to reading the papers mentioned…


snakers4 (Alexander), August 21, 13:31

2018 DS/ML digest 21

spark-in.me/post/2018_ds_ml_digest_21

#digest

#deep_learning

#nlp

2018 DS/ML digest 21

2018 DS/ML digest 21 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), August 16, 06:00

Google updates its transformer

ai.googleblog.com/2018/08/moving-beyond-translation-with.html

#nlp

Moving Beyond Translation with the Universal Transformer

Posted by Stephan Gouws, Research Scientist, Google Brain Team and Mostafa Dehghani, University of Amsterdam PhD student and Google Research...


snakers4 (Alexander), August 10, 11:39

Using numba

Looks like ... it just works when it works.

For example this cosine distance calculation function works ca 10x faster.

@numba.jit(target='cpu', nopython=True)

def fast_cosine(u, v):

m = u.shape[0]

udotv = 0

u_norm = 0

v_norm = 0

for i in range(m):

if (np.isnan(u[i])) or (np.isnan(v[i])):

continue

udotv += u[i] * v[i]

u_norm += u[i] * u[i]

v_norm += v[i] * v[i]

u_norm = np.sqrt(u_norm)

v_norm = np.sqrt(v_norm)

if (u_norm == 0) or (v_norm == 0):

ratio = 1.0

else:

ratio = udotv / (u_norm * v_norm)

return 1-ratioAlso looks like they recently were supported by NumFocus

numfocus.org/sponsored-projects

#nlp

Sponsored Projects | pandas, NumPy, Matplotlib, Jupyter, + more - NumFOCUS

Explore NumFOCUS Sponsored Projects, including: pandas, NumPy, Matplotlib, Jupyter, rOpenSci, Julia, Bokeh, PyMC3, Stan, nteract, SymPy, FEniCS, PyTables...


snakers4 (Alexander), August 06, 10:44

NLP - naive preprocessing

A friend has sent me a couple of gists

- gist.github.com/thinline72/e35e1aaa09bd5519b7f07663152778e7

- gist.github.com/thinline72/29d3976e434572ef3ee68ab7a473b400

Useful boilerplate

#nlp

quora_vecs_l2_test.ipynb

GitHub Gist: instantly share code, notes, and snippets.


snakers4 (Alexander), July 31, 05:18

Some interesting NLP related ideas from ACL 2018

ruder.io/acl-2018-highlights/

Overall

- bag-of-embeddings is surprisingly good at capturing sentence-level properties, among other results

- language models are bad at modelling numerals and propose several strategies to improve them

- current state-of-the-art models fail to capture many simple inferences

- LSTM representations, even though they have been trained on one task, are not task-specific. They are often predictive of unintended aspects such as demographics in the data

- Word embedding-based methods exhibit competitive or even superior performance

Four common ways to introduce linguistic information into models:

- Via a pipeline-based approach, where linguistic categories are used as features;

- Via data augmentation, where the data is augmented with linguistic categories;

- Via multi-task learning;

#nlp

ACL 2018 Highlights: Understanding Representations

This post reviews two themes of ACL 2018: 1) gaining a better understanding what models capture and 2) to expose them to more challenging settings.


snakers4 (Alexander), June 25, 10:53

A subscriber sent a really decent CS university scientific ranking

csrankings.org/#/index?all&worldpu

Useful, if you want to apply for CS/ML based Ph.D. there

#deep_learning

Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch

github.com/huggingface/pytorch-openai-transformer-lm

Nice!

#nlp

#deep_learning

huggingface/pytorch-openai-transformer-lm

pytorch-openai-transformer-lm - A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI


snakers4 (Alexander), April 17, 07:39

Nice realistic article about bias in embeddings by Google

developers.googleblog.com/2018/04/text-embedding-models-contain-bias.html

#google

#nlp

Text Embedding Models Contain Bias. Here's Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we'll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.


snakers4 (Alexander), April 01, 07:57

Novel topic modelling techniques

- bigartm.readthedocs.io/en/stable/intro.html

- github.com/bigartm/bigartm

Looks interesting.

If anyone knows about this - please ping in PM.

#nlp

snakers4 (Alexander), March 26, 13:26

NLP project peculiarities

(0) Always handle new words somehow

(1) Easy evaluation of test results - you can just look at it

(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects

www.youtube.com/watch?v=Ozm0bEi5KaI

(0) Basic pipeline

prntscr.com/iwhlsx

(1) Basic preprocessing

- Stemming / lemmatization

- Regular expressions

(2) Naive / old school approaches that can just work

- Bag of Words => simple model

- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings

- Average / sum of Word2Vec embeddings

- Word2Vec * tf-idf >> Doc2Vec

- Small documents => embeddings work better

- Big documents => bag of features / high level features

(4) Sentiment analysis features

- prntscr.com/iwhzqk

- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec

- medium.com/@premrajnarkhede/sentence2vec-evaluation-of-popular-theories-part-i-simple-average-of-word-vectors-3399f1183afe

- medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

(6) Transfer learning in NLP - looks like it may become more popular / prominent

- Jeremy Howard's preprint on NLP transfer learning - arxiv.org/abs/1801.06146

#data_science

#nlp

ML tutorial for NLP, Алексей Натекин

snakers4 (Alexander), March 26, 09:56

So, I have briefly watched Andrew Ng's series on RNNs.

It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.

Also he explains stuff with really simple and clear illustrations.

Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.

(which I did enough during his classic course)

Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:

(0) Key NLP sota achievements in 2017

-- medium.com/@madrugado/advances-in-nlp-in-2017-b00e927fcc57

-- medium.com/@madrugado/advances-in-nlp-in-2017-part-ii-d8da391a3f01

(1) Consider fast.ai courses and notebooks github.com/fastai/courses/tree/master/deeplearning2

(2) Consider NLP newsletter newsletter.ruder.io

(3) Consider excellent PyTorch tutorials pytorch.org/tutorials/

(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)

(5) Brief 1-hour intro to practical NLP www.youtube.com/watch?v=Ozm0bEi5KaI

Also related posts on the channel / libraries:

(1) Pre-trained vectors in Russian - //snakers41.spark-in.me/1623

(2) How to learn about CTC loss //snakers41.spark-in.me/1690 (when our seq2seq )

(3) Most popular MLP libraries for English - //snakers41.spark-in.me/1832

(4) NER in Russian - habrahabr.ru/post/349864/

(5) Lemmatization library in Russian - pymorphy2.readthedocs.io/en/latest/user/guide.html - recommended by a friend

Basic tasks considered more or less solved by RNNs

(1) Speech recognition / trigger word detection

(2) Music generation

(3) Sentiment analysis

(4) Machine translation

(5) Video activity recognition / tagging

(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:

(1) Different length of input and output

(2) Features for different positions in the sequence are not shared

(3) Enormous number of params

Typical word representations

(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)

(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs

(0) Typical architectures - deep GRU (lighter) and LSTM cells

(1) Tanh or RELU for hidden layer activation

(2) Sigmoid for output when classifying

(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens

(4) Usually word level models are used (not character level)

(5) Passing hidden state in encoder-decoder architectures

(6) Vanishing gradients - typically GRUs / LSTMs are used

(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)

(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)

(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass

(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state

(11) Finetune imported embeddings for smaller tasks with smaller datasets

(12) On big datasets - may make sense to learn embeddings from scratch

(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors

(0) Cosine (angle)

(1) Eucledian

Seminal papers / consctructs / ideas:

(1) Training embeddings - the later the methods came out - the simpler they are

- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)

-- arxiv.org/abs/1310.4546

-- useful ideas

-- if there is information - a simple model (i.e. logistic regression) will work

-- negative subsampling -

Advances in NLP in 2017

Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other…


sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words

-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update

-- skip-gram model in a nutshell - prntscr.com/iwfwb2

- GloVe - Global Vectors (2014)

-- aclweb.org/anthology/D14-1162

-- supposedly GloVe is better given same resources than Word2Vec - prntscr.com/iwf9bx

-- in practice word vectors with 200 dimensions are enough for applied tasks

-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation

- essentially an exp of modified precision index for logs of 4 n-grams

- prntscr.com/iwe3v2

- dl.acm.org/citation.cfm?id=1073135

(3) Attention is all you need

- arxiv.org/abs/1706.03762

To be continued.

#data_science

#nlp

#rnns

Screenshot

Captured with Lightshot


snakers4 (Alexander), March 20, 13:40

A video about realistic state of chat-bots (RU)

www.facebook.com/deepmipt/videos/vb.576961855845861/868578240017553/?type=2&theater

#nlp

#data_science

#deep_learning

Neural Networks and Deep Learning lab at MIPT

При звонке в сервис или банк диалог с поддержкой строится по одному и тому же сценарию с небольшими вариациями. Отвечать «по бумажке» на вопросы может и не человек, а чат-бот. О том, как нейронные...


snakers4 (Alexander), February 28, 10:40

Forwarded from Data Science:

Most common libraries for Natural Language Processing:

CoreNLP from Stanford group:

stanfordnlp.github.io/CoreNLP/index.html

NLTK, the most widely-mentioned NLP library for Python:

www.nltk.org/

TextBlob, a user-friendly and intuitive NLTK interface:

textblob.readthedocs.io/en/dev/index.html

Gensim, a library for document similarity analysis:

radimrehurek.com/gensim/

SpaCy, an industrial-strength NLP library built for performance:

spacy.io/docs/

Source: itsvit.com/blog/5-heroic-tools-natural-language-processing/

#nlp #digest #libs

Stanford CoreNLP

High-performance human language analysis tools. Widely used, aavailable open source; written in Java.


snakers4 (Alexander), January 19, 12:34

Just found out about Facebook's fast text

- github.com/facebookresearch/fastText

Seems to be really promising

#data_science

#nlp

facebookresearch/fastText

fastText - Library for fast text representation and classification.


snakers4 (Alexander), December 27, 11:00

Недавно поднимал вопрос работы с pre-trained embeddings.

До дела не дошло, но вот ссылки набрались полезные

- Работа с готовыми векторами для текста в Pytorch

-- github.com/A-Jacobson/CNN_Sentence_Classification/blob/master/WordVectors.ipynb

-- discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222/11

-- github.com/pytorch/text/blob/master/torchtext/vocab.py

- И еше ссылка на пост с векторами для русского языка

-- //snakers41.spark-in.me/1623

#data_science

#deep_learning

#nlp

A-Jacobson/CNN_Sentence_Classification

CNN_Sentence_Classification - pytorch Convolutional Networks for Sentence Classification - http://www.aclweb.org/anthology/D14-1181


snakers4 (Alexander), December 15, 08:34

Знакомый посоветовал огромную базу с корпусами и моделями векторными для русского языка.

Стильно, модно молодежно

- rusvectores.org/ru/models/

- nlpub.ru/Russian_Distributional_Thesaurus

- opencorpora.org/?page=downloads

- vectors.nlpl.eu/repository/

Раньше я думал, что такого особо нет нигде.

#data_science

#nlp

RusVectōrēs: модели

РусВекторес: дистрибутивная семантика для русского языка, веб-интерфейс и модели для скачивания


snakers4 (Alexander), November 02, 02:17

New AI Grant Fellows

blog.aigrant.org/new-ai-grant-fellows-43f1c26c13d9

New AI Grant Fellows

AI Grant is a decentralized AI lab. We fund brilliant minds around the world to work on AI research.


Опен сорсная библиотека для выделения векторов из текста

radimrehurek.com/gensim/intro.html

#nlp

#data_science

gensim: topic modelling for humans

Efficient topic modelling in Python


snakers4 (Alexander), September 18, 13:25

В процессе поиска решения для задачи А сюда - contest.sdsj.ru - натолкнулся на ряд полезных ссылок:

- Настройка параметров XGBoost - goo.gl/Av7D1q

- XGBoost на GPU из коробки - goo.gl/TWuauv

Пока использование embeddings дало +5% к бейслайну и 18 место из 50 человек. Интуиция подсказывает, что оптимальное решение лежит где-то в сфере смеси авто-энкодеров/декодеров и LTSM.

- ipynb - resources.spark-in.me/baseline.ipynb

- html - resources.spark-in.me/baseline.html

#data_science

#nlp

Sberbank Data Science Contest

Machine learning event series from Sberbank