Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1319 members, 1513 posts since 2016

All this - lost like tears in rain.

Data science, deep learning, sometimes a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- goo.gl/WRm93d
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

Posts by tag «nlp»:

snakers4 (Alexander), June 25, 10:53

A subscriber sent a really decent CS university scientific ranking

csrankings.org/#/index?all&worldpu

Useful, if you want to apply for CS/ML based Ph.D. there

#deep_learning

Transformer in PyTorch

Looks like somebody implement recent Google's transformer fine-tuning in PyTorch

github.com/huggingface/pytorch-openai-transformer-lm

Nice!

#nlp

#deep_learning

huggingface/pytorch-openai-transformer-lm

pytorch-openai-transformer-lm - A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI


snakers4 (Alexander), April 17, 07:39

Nice realistic article about bias in embeddings by Google

developers.googleblog.com/2018/04/text-embedding-models-contain-bias.html

#google

#nlp

Text Embedding Models Contain Bias. Here's Why That Matters.

Human data encodes human biases by default. Being aware of this is a good start, and the conversation around how to handle it is ongoing. At Google, we are actively researching unintended bias analysis and mitigation strategies because we are committed to making products that work well for everyone. In this post, we'll examine a few text embedding models, suggest some tools for evaluating certain forms of bias, and discuss how these issues matter when building applications.


snakers4 (Alexander), April 01, 07:57

Novel topic modelling techniques

- bigartm.readthedocs.io/en/stable/intro.html

- github.com/bigartm/bigartm

Looks interesting.

If anyone knows about this - please ping in PM.

#nlp

snakers4 (Alexander), March 26, 13:26

NLP project peculiarities

(0) Always handle new words somehow

(1) Easy evaluation of test results - you can just look at it

(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning

Basic Approaches to modern NLP projects

www.youtube.com/watch?v=Ozm0bEi5KaI

(0) Basic pipeline

prntscr.com/iwhlsx

(1) Basic preprocessing

- Stemming / lemmatization

- Regular expressions

(2) Naive / old school approaches that can just work

- Bag of Words => simple model

- Bag of Words => tf-idf => SVD / PCA / NMF => simple model

(3) Embeddings

- Average / sum of Word2Vec embeddings

- Word2Vec * tf-idf >> Doc2Vec

- Small documents => embeddings work better

- Big documents => bag of features / high level features

(4) Sentiment analysis features

- prntscr.com/iwhzqk

- n-chars => won several Kaggle competitions

(5) Also a couple of articles for developing intuition for sentence2vec

- medium.com/@premrajnarkhede/sentence2vec-evaluation-of-popular-theories-part-i-simple-average-of-word-vectors-3399f1183afe

- medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e

(6) Transfer learning in NLP - looks like it may become more popular / prominent

- Jeremy Howard's preprint on NLP transfer learning - arxiv.org/abs/1801.06146

#data_science

#nlp

ML tutorial for NLP, Алексей Натекин

snakers4 (Alexander), March 26, 09:56

So, I have briefly watched Andrew Ng's series on RNNs.

It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.

Also he explains stuff with really simple and clear illustrations.

Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.

(which I did enough during his classic course)

Also no GPU tricks / no research / production boilerplate ofc.

Below are ideas and references that may be useful to know about NLP / RNNs for everyone.

Also for NLP:

(0) Key NLP sota achievements in 2017

-- medium.com/@madrugado/advances-in-nlp-in-2017-b00e927fcc57

-- medium.com/@madrugado/advances-in-nlp-in-2017-part-ii-d8da391a3f01

(1) Consider fast.ai courses and notebooks github.com/fastai/courses/tree/master/deeplearning2

(2) Consider NLP newsletter newsletter.ruder.io

(3) Consider excellent PyTorch tutorials pytorch.org/tutorials/

(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)

(5) Brief 1-hour intro to practical NLP www.youtube.com/watch?v=Ozm0bEi5KaI

Also related posts on the channel / libraries:

(1) Pre-trained vectors in Russian - snakers41.spark-in.me/1623

(2) How to learn about CTC loss snakers41.spark-in.me/1690 (when our seq2seq )

(3) Most popular MLP libraries for English - snakers41.spark-in.me/1832

(4) NER in Russian - habrahabr.ru/post/349864/

(5) Lemmatization library in Russian - pymorphy2.readthedocs.io/en/latest/user/guide.html - recommended by a friend

Basic tasks considered more or less solved by RNNs

(1) Speech recognition / trigger word detection

(2) Music generation

(3) Sentiment analysis

(4) Machine translation

(5) Video activity recognition / tagging

(6) Named entity recognition (NER)

Problems with standard CNN when modelling sequences:

(1) Different length of input and output

(2) Features for different positions in the sequence are not shared

(3) Enormous number of params

Typical word representations

(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)

(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+

Typical rules of thumb / hacks / practical approaches for RNNs

(0) Typical architectures - deep GRU (lighter) and LSTM cells

(1) Tanh or RELU for hidden layer activation

(2) Sigmoid for output when classifying

(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens

(4) Usually word level models are used (not character level)

(5) Passing hidden state in encoder-decoder architectures

(6) Vanishing gradients - typically GRUs / LSTMs are used

(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)

(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)

(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass

(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state

(11) Finetune imported embeddings for smaller tasks with smaller datasets

(12) On big datasets - may make sense to learn embeddings from scratch

(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable

Typical similarity functions for high-dim vectors

(0) Cosine (angle)

(1) Eucledian

Seminal papers / consctructs / ideas:

(1) Training embeddings - the later the methods came out - the simpler they are

- Matrix factorization techniques

- Naive approach using language model + softmax (non tractable for large corpuses)

- Negative sampling + skip gram + logistic regression = Word2Vec (2013)

-- arxiv.org/abs/1310.4546

-- useful ideas

-- if there is information - a simple model (i.e. logistic regression) will work

-- negative subsampling -

Advances in NLP in 2017

Disclaimer: First of all I need to say that all these trends and advances are only my point of view, other people could have other…


sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words

-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update

-- skip-gram model in a nutshell - prntscr.com/iwfwb2

- GloVe - Global Vectors (2014)

-- aclweb.org/anthology/D14-1162

-- supposedly GloVe is better given same resources than Word2Vec - prntscr.com/iwf9bx

-- in practice word vectors with 200 dimensions are enough for applied tasks

-- considered to be one of sota solutions now (afaik)

(2) BLEU score for translation

- essentially an exp of modified precision index for logs of 4 n-grams

- prntscr.com/iwe3v2

- dl.acm.org/citation.cfm?id=1073135

(3) Attention is all you need

- arxiv.org/abs/1706.03762

To be continued.

#data_science

#nlp

#rnns

Screenshot

Captured with Lightshot


snakers4 (Alexander), March 20, 13:40

A video about realistic state of chat-bots (RU)

www.facebook.com/deepmipt/videos/vb.576961855845861/868578240017553/?type=2&theater

#nlp

#data_science

#deep_learning

Neural Networks and Deep Learning lab at MIPT

При звонке в сервис или банк диалог с поддержкой строится по одному и тому же сценарию с небольшими вариациями. Отвечать «по бумажке» на вопросы может и не человек, а чат-бот. О том, как нейронные...


snakers4 (Alexander), February 28, 10:40

Forwarded from Data Science:

Most common libraries for Natural Language Processing:

CoreNLP from Stanford group:

stanfordnlp.github.io/CoreNLP/index.html

NLTK, the most widely-mentioned NLP library for Python:

www.nltk.org/

TextBlob, a user-friendly and intuitive NLTK interface:

textblob.readthedocs.io/en/dev/index.html

Gensim, a library for document similarity analysis:

radimrehurek.com/gensim/

SpaCy, an industrial-strength NLP library built for performance:

spacy.io/docs/

Source: itsvit.com/blog/5-heroic-tools-natural-language-processing/

#nlp #digest #libs

Stanford CoreNLP

High-performance human language analysis tools. Widely used, aavailable open source; written in Java.


snakers4 (Alexander), January 19, 12:34

Just found out about Facebook's fast text

- github.com/facebookresearch/fastText

Seems to be really promising

#data_science

#nlp

facebookresearch/fastText

fastText - Library for fast text representation and classification.


snakers4 (Alexander), December 27, 11:00

Недавно поднимал вопрос работы с pre-trained embeddings.

До дела не дошло, но вот ссылки набрались полезные

- Работа с готовыми векторами для текста в Pytorch

-- github.com/A-Jacobson/CNN_Sentence_Classification/blob/master/WordVectors.ipynb

-- discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222/11

-- github.com/pytorch/text/blob/master/torchtext/vocab.py

- И еше ссылка на пост с векторами для русского языка

-- snakers41.spark-in.me/1623

#data_science

#deep_learning

#nlp

A-Jacobson/CNN_Sentence_Classification

CNN_Sentence_Classification - pytorch Convolutional Networks for Sentence Classification - http://www.aclweb.org/anthology/D14-1181


snakers4 (Alexander), December 15, 08:34

Знакомый посоветовал огромную базу с корпусами и моделями векторными для русского языка.

Стильно, модно молодежно

- rusvectores.org/ru/models/

- nlpub.ru/Russian_Distributional_Thesaurus

- opencorpora.org/?page=downloads

- vectors.nlpl.eu/repository/

Раньше я думал, что такого особо нет нигде.

#data_science

#nlp

RusVectōrēs: модели

РусВекторес: дистрибутивная семантика для русского языка, веб-интерфейс и модели для скачивания


snakers4 (Alexander), November 02, 02:17

New AI Grant Fellows

blog.aigrant.org/new-ai-grant-fellows-43f1c26c13d9

New AI Grant Fellows

AI Grant is a decentralized AI lab. We fund brilliant minds around the world to work on AI research.


Опен сорсная библиотека для выделения векторов из текста

radimrehurek.com/gensim/intro.html

#nlp

#data_science

gensim: topic modelling for humans

Efficient topic modelling in Python


snakers4 (Alexander), September 18, 13:25

В процессе поиска решения для задачи А сюда - contest.sdsj.ru - натолкнулся на ряд полезных ссылок:

- Настройка параметров XGBoost - goo.gl/Av7D1q

- XGBoost на GPU из коробки - goo.gl/TWuauv

Пока использование embeddings дало +5% к бейслайну и 18 место из 50 человек. Интуиция подсказывает, что оптимальное решение лежит где-то в сфере смеси авто-энкодеров/декодеров и LTSM.

- ipynb - resources.spark-in.me/baseline.ipynb

- html - resources.spark-in.me/baseline.html

#data_science

#nlp

Sberbank Data Science Contest

Machine learning event series from Sberbank