New video from 3B1B
Which is kind of relevant
Our Transformer post was featured by Towards Data Science
New tricks for training CNNs
Our experiments with Transformers, BERT and generative language pre-training
For morphologically rich languages pre-trained Transformers are not a silver bullet and from a layman's perspective they are not feasible unless someone invests huge computational resources into sub-word tokenization methods that work well + actually training these large networks.
On the other hand we have definitively shown that:
- Starting a transformer with Embedding bag initialized via FastText works and is relatively feasible;
- On complicated tasks - such transformer significantly outperforms training from scratch (as well as naive models) and shows decent results compared to state-of-the-art specialized models;
- Pre-training worked, but it overfitted more thatn FastText initialization and given the complexity required for such pre-training - it is not useful;
All in all this was a relatively large gamble, which did not pay off - on some more down-to-earth task we hoped the Transformer would excel at - it did not.
Complexity / generalization /computational cost in modern applied NLP for morphologically rich languages. Towards a new state of the art? Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me
An approach to ranking search results with no annotation
Just a small article with a novel idea:
- Instead of training a network with CE - just train it with BCE;
- Source additional structure from the inner structure of your domain (tags, matrix decomposition methods, heuristics, etc);
Works best if your ontology is relatively simple.
Inception v1 layers visualized on a map
A joint work by Google and OpenAI:
- Take 1M random images;
- Feed to a CNN, collect some spatial activation;
- Produce a corresponding idealized image that would result in such an activation;
- Plot in 2D (via UMAP), add grid, averaging, etc etc;
By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned and what concepts it typically represents.
Russian STT datasets
Anyone knows more proper datasets?
I found this (60 hours), but I could not find the link to the dataset:
Anyway, here is the list I found:
- 20 hours of Bible github.com/festvox/datasets-CMU_
- And some disappointment here voice.mozilla.org/ru/languages
5th 2019 DS / ML digest
Highlights of the week
- New Adam version;
- POS tagging and semantic parsing in Russian;
- ML industrialization again;
Anyone knows anyone from TopCoder?
As usual with competition platforms organization sometimes has its issues
Tracking your hardware ... for data science
For a long time I though that if you really want to track all your servers' metrics you need Zabbix (which is very complicated).
A friend recommended me an amazing tool
It installs and runs literally in minutes.
If you want to auto-start it properly, there are even a bit older Ubuntu packages and systemd examples
Dockerized metric exporters for GPUs by Nvidia
It also features extensive alerting features, but they are very difficult to easily start, there being no minimal example
An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.
LSTM vs TCN vs Trellis network
- Did not try the Trellis network - decided it was too complex;
- All the TCN properties from the digest spark-in.me/post/2018_ds_ml_dige
- Looks like a really simple and reasonable alternative for RNNs for modeling and ensembling;
- On a sensible benchmark - performes mostly the same as LSTM from a practical standpoint;
Dependency parsing and POS tagging in Russian
Less popular set of NLP tasks.
Popular tools reviewed
(0) Well known
Only POS tags and morphology:
(0) github.com/IlyaGusev/rnnmorph (easy to use);
(1) github.com/nlpub/pymystem3 (easy to use);
Full dependency parsing
(0) Russian spacy plugin:
- github.com/buriy/spacy-ru - installation
(1) Malt parser based solution (drawback - no examples)
(2) Google's syntaxnet
We tried it
... yeah we tried it on a real task
just adam is a bit better
New variation of Adam?
Eliminate the generalization gap between adaptive methods and SGD;
TL;DR: A Faster And Better Optimizer with Highly Robust Performance;
- Dynamic bound on learning rates. Inspired by gradient clipping;
- Not very sensitive to the hyperparameters, especially compared with Sgd(M);
- Tested on MNIST, CIFAR, Penn Treebank - no serious datasets;
Abstract Adaptive optimization methods such as AdaGrad, RMSProp and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared with Sgd or even fail to converge due to unstable and extreme learning rates. Recent work has put forward some algorithms such as AMSGrad to tackle this issue but they failed to achieve considerable improvement over existing methods.
4th 2019 DS / ML digest
Highlights of the week
- OpenAI controversy;
- BERT pre-training;
- Using transformer for conversational challenges;
A bit of lazy Sunday admin stuff
Monitoring you CPU temperature with email notifications
- Change CPU temp to any metric you like
- Rolling log
- Sending email only one time, if the metric becomes critical (you can add an email when metric becomes non-critical again)
Setting up a GPU box on Ubuntu 18.04 from scratch
What is this channel about?
This channel is a practitioner's channel on the following topics: Internet, Data Science, Deep Learning, Python, NLP
Don't get your opinion in a twist if your opinion differs.
No BS and ads - I already rejected 3-4 crappy ad deals
DS ML digests - in the RSS or via URLs like this
Buy me a coffee 🤟 buymeacoff.ee/8oneCIN
Give us a rating:
Our website spark-in.me
Our chat t.me/joinchat/Bv9tjkH9JHYvOr92hi
DS courses review (RU) - very old
2017 - 2018 SpaceNet Challenge
DS Bowl 2018
Data Science tag on the website
CFT 2018 competition
More amazing NLP-related articles incoming!
Maybe finally we will make podcasts?
Whict type of content do you / would you like most on the channel?
- Weekly / bi-weekly digests; (34)
- Full articles; (13)
- Podcasts with actual ML practicioners; (12)
- Practical bits on real applied NLP; (28)
- Pre-trained BERT with Embedding Bags for Russian; (11)
- Paper reviews; (21)
- Jokes / memes / cats; (9)
(2) is valid for models with complex forward pass and models with large embedding layers
PyTorch NLP best practices
Very simple ideas, actually.
(1) Multi GPU parallelization and FP16 training
Do not bother reinventing the wheel.
Just use nvidia's
Best examples [here](github.com/huggingface/pytorch-p
(2) Put as much as possible INSIDE of the model
Implement the as much as possible of your logic inside of
So that you can seamleassly you all the abstractions from (1) with ease.
Also models are more abstract and reusable in general.
(3) Why have a separate train/val loop?
PyTorch 0.4 introduced context handlers.
You can simplify your train / val / test loops, and merge them into one simple function.
context = torch.no_grad() if loop_type=='Val' else torch.enable_grad()
for i, (some_tensor) in enumerate(tqdm(train_loader)):
# do your stuff here
Use EmbeddingBag layer for morphologically rich languages. Seriously!
(5) Writing trainers / training abstractions
This is waste of time imho if you follow (1), (2) and (3).
(6) Nice bonus
If you follow most of these, you can train on as many GPUs and machines as you wan for any language)
(7) Using tensorboard for logging
This goes without saying.
PyTorch DataLoader, GIL thrashing and CNNs
Well all of this seems a bit like magic to me, but hear me out.
I abused my GPU box for weeks running CNNs on 2-4 GPUs.
And then my GPU box started shutting down for no apparent reason.
No, this was not:
- CPU overheating (I have a massive cooler, I checked - it works);
- It also adds to confusion that AMD has weird temperature readings;
To cut the story short - if you have a very fast Dataset class and you use PyTorch's DataLoader with
workers > 0 it can lead to system instability instead of speeding up.
It is obvious in retrospect, but it is not when you face this issue.
Russian thesaurus that really works
It knows so many peculiar / old-fashioned and cheeky synonyms for obscene words!
Russian Distributional Thesaurus (сокр. RDT) — проект создания открытого дистрибутивного тезауруса русского языка. На данный момент ресурс содержит несколько компонент: вектора слов (word embeddings), граф подобия слов (дистрибутивный тезаурус), множество гиперонимов и инвентарь смыслов слов. Все ресурсы были построены автоматически на основании корпуса текстов книг на русском языке (12.9 млрд словоупотреблений). В следующих версиях ресурса планируется добавление и векторов смыслов слов для русского языка, которые были получены на основании того же корпуса текстов. Проект разрабатывается усилиями представителей УрФУ, МГУ им. Ломоносова, Университета Гамбурга. В прошлом в проект внесли свой вклад исследователи из Южно-Уральского государственного университета, Дармштадского технического университета, Волверхемтонского университета и Университета Тренто.
Old news ... but Attention works
Funny enough, but in the past my models :
- Either did not need attention;
- Attention was implemented by @thinline72 ;
- The domain was so complicated (NMT) so that I had to resort to boilerplate with key-value attention;
It was the first time I / we tried manually building a model with plain self attention from scratch.
An you know - it really adds 5-10% to all of the tracked metrics.
Best plain attention layer in PyTorch - simple, well documented ... and it works in real life applications: