Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1797 members, 1726 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page[]=year%3A%222018%22

Donate them here




Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...

snakers4 (Alexander), April 17, 08:47

Using snakeviz for profiling Python code


To profile complicated and convoluted code.

Snakeviz is a cool GUI tool to analyze cProfile profile files.

Just launch your code like this

python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.


They have a server GUI and a jupyter notebook plugin.

Also you can launch their tool from within a docker container:

snakeviz -s -H profile_file.cprofile

Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.



SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.

snakers4 (Alexander), March 26, 04:44

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.

Luckily, some anonymous backed the dataset up.

Anyway - use it.

Yeah, it is small. But it is free, so whatever.



Download Dataset.tar.gz 1.57 MB

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:


2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;



2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;




2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;




2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - Блог -

snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;




2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - Блог -

snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;




2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - Блог -

snakers4 (Alexander), January 31, 09:41

Second 2019 DS / ML digest

Highlight of the week - Facebook's LASER.




2019 DS/ML digest 02

2019 DS/ML digest 02 Статьи автора - Блог -

snakers4 (Alexander), January 31, 08:38

Jupiter widgets + pandas

With the @interact decorator, the IPywidgets library automatically gives us a text box and a slider for choosing a column and number! It looks at the inputs



Interactive Controls in Jupyter Notebooks

How to use IPywidgets to enhance your data exploration and analysis

snakers4 (Alexander), January 30, 12:06

Serialization of large objects in Python

So far found no sane way for this with 1M chunks / 10GB+ object size.

Of course, chunking / plain txt works.

Feather / parquet - fail with 2+GB size.

Pickle works, but it is kind of slow.



snakers4 (Alexander), January 15, 08:33

First 2019 DS / ML digest

No particular highlights - just maybe ML industrialization vector is here to stay?




2019 DS/ML digest 01

2019 DS/ML digest 01 Статьи автора - Блог -

snakers4 (Alexander), December 30, 2018

Spark in me 2018 annual retrospective


- My personal progress and some views;

- ML is still amazing, but there are no illusions anymore;

- Telegram is still amazing, but commercialization looms;

- FAIR is an inspiration;

- Imcinnes with UMAP and HDBSCAN as well;


Еще написал немного по-русски, немного со спецификой, если вам так удобнее



Spark in me - annual retrospective 2018

Spark in me - annual retrospective 2018 Статьи автора - Блог -

snakers4 (Alexander), December 19, 2018

DS/ML digest 32


- A way to replace softmax in NMT;

- Large visual reasoning dataset;

- PyText;




2018 DS/ML digest 32

2018 DS/ML digest 32 Статьи автора - Блог -

snakers4 (Alexander), December 10, 2018

Simpsons paradox

Nice explanation


Simpson’s Paradox and Interpreting Data

The challenge of finding the right view through data

snakers4 (Alexander), December 09, 2018

DS/ML digest 31

Highlights of the week:

- PyTorch 1.0 released;

- Drawing with GANs;

- BERT explained;




2018 DS/ML digest 31

2018 DS/ML digest 31 Статьи автора - Блог -

snakers4 (Alexander), December 02, 2018

A cheeky ML/DS themed sticker pack for our channel

Thanks to @birdborn for his art.

You are welcome to use it:

If you would like to contribute / create your own stickers - please ask around in our channel chat.


snakers4 (Alexander), November 28, 2018

DS/ML digest 30




2018 DS/ML digest 30

2018 DS/ML digest 30 Статьи автора - Блог -

snakers4 (Alexander), November 23, 2018

Jupyter extensions

Looks like they are near end of their support.


On a fresh build you will need this

conda install notebook=5.6

To use them.

Will need to invest some time into making Jupyter Lab actually usable.


snakers4 (Alexander), November 22, 2018

Our victory in CFT-2018 competition


- Multi-task learning + seq2seq models rule;

- The domain seems to be easy, but it is not;

- You can also build a pipeline based on manual features, but it will not be task agnostic;

- Loss weighting is crucial for such tasks;

- Transformer trains 10x longer;




Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain Статьи автора - Блог -

snakers4 (Alexander), November 15, 2018

DS/ML digest 29




2018 DS/ML digest 29

2018 DS/ML digest 29 Статьи автора - Блог -

snakers4 (Alexander), November 10, 2018

Towards Data Science

Our article was accepted to their publication:


Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)

I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.


Building client routing / semantic search in the wild

A comparison of novel NLP techniques within an applied business setting

snakers4 (Alexander), November 06, 2018

DS/ML digest 28

Google open sources pre-trained BERT ... with 102 languages ...




2018 DS/ML digest 28

2018 DS/ML digest 28 Статьи автора - Блог -

snakers4 (Alexander), November 03, 2018

Building client routing / semantic search and clustering arbitrary external corpuses at

A brief executive summary about what we achieved at

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.




Building client routing / semantic search and clustering arbitrary external corpuses at

Building client routing / semantic search and clustering arbitrary external corpuses at Статьи автора - Блог -

snakers4 (Alexander), October 23, 2018

DS/ML digest 27

NLP in the focus again!

Also your humble servant learned how to do proper NMT =)




2018 DS/ML digest 27

2018 DS/ML digest 27 Статьи автора - Блог -

snakers4 (Alexander), October 22, 2018

Amazing articles about image hashing

Also a python library

- Library

- Articles:




A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.

Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models -

It is explained here - - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.



snakers4 (Alexander), October 15, 2018

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:


kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 2018

DS/ML digest 26

More interesting NLP papers / material ...




2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - Блог -

snakers4 (Alexander), October 10, 2018

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal:$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives

- Google group!topic/common-crawl/6F-yXsC35xM



Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - with -x

(3) Profit


snakers4 (Alexander), October 08, 2018

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- for large data-frames + some nice visualizations;

- for large visualizations;

- Also you can use Dask for these purposes I guess;


Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.


older first