Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1810 members, 1762 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
Our chat
DS courses review

Posts by tag «data_science»:

snakers4 (Alexander), August 15, 14:42

My foray into the STT Dark Forest

My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality




Navigating the Speech to Text Dark Forest

A tongue-in-cheek description of our STT path Статьи автора - Блог -

snakers4 (Alexander), August 11, 04:42

Extreme NLP network miniaturization

Tried some plain RNNs on a custom in the wild NER task.

The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.

I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.

What is interesting:

- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;

- Model works with various hidden sizes

- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;

- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;

As added bonus - you can just store such miniature model in git w/o lfs.

What is with training transformers on US$250k worth of compute credits you say?)




A new model for word embeddings that are resilient to misspellings

Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.

snakers4 (Alexander), August 02, 11:19

Managing your DS / ML environment neatly and in style

If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.

You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).

But what you have to do this for several people? And use it with a proper IDE via ssh?

A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.

And also you have to make your ssh daemon works inside of a container as a second service.

So I solved these "challenges" and created 2 public layers so far:

- Basic DS / ML layer - FROM aveysov/ml_images:layer-0 - from dockerfile;

- DS / ML libraries - FROM aveysov/ml_images:layer-0- from dockerfile;

Your final dockerfile may look something like this just pulling from any of those layers.

Note that when building this, you will need to pass your UID as a variable, e.g.:

docker build --build-arg NB_UID=1000 -t av_final_layer -f Layer_final.dockerfile .

When launched, this launched a notebook with extensions. You can just exec into the machine itself to run scripts or use an ssh daemon inside (do not forget to add your ssh key and service ssh start).




Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.

Using public Dockerhub account for your private small scale deploy

Also a lifehack - you can just use Dockerhub for your private stuff, just separate the public part and the private part.

Push the public part (i.e. libraries and frameworks) to Dockerhub/

You private Dockerfile will be then something like:

FROM your_user/your_repo:latest

COPY your_app_folder your_app_folder


CMD ["python3", ""]

snakers4 (Alexander), July 18, 04:55

An ideal remote IDE?


No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.

I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.

The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.

So for now (it is personal) - best tools are in my opinion:

- Notebooks - for exploration and testing;

- VScode for codebase;

- Atom - for local scripts;


Visual Studio Code Remote Development

snakers4 (Alexander), July 16, 12:53

Full IDE in a browser?


You all know all the pros and cons of:

- IDEs (PyCharm);

- Advanced text editors (Atom, Sublime Text);

- Interactive environments (notebook / lab, Atom + Hydrogen);

I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).

But what if I told you there is a third option? =)

If you work as a team on a remote machine / set of machines?

TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.

Now you can just run it with one command.


- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);

- Pre-built images available;

- It is extendible - new modules get released - you can build yourself or just find a build;

- It has extensive linting, python language server (just a standard library though);

- It has full text search ... kind of;

- Follow definition in your code;

- Docstrings and auto-complete work for your modules and standard library (not for you packages);

Looks cool af!

If they ship a build with a remote python kernel, then it will be a perfect option for teams!

I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).


- Website;

- Pre-built apps for python;

- Language server they are using;


Theia - Cloud and Desktop IDE

Theia is an open-source cloud   desktop IDE framework implemented in TypeScript.

If you know how to add your python kernel to Theia - please ping me)

snakers4 (Alexander), July 15, 04:48

Trying to migrate to JupyterLab from Jupyter Notebook?

Some time ago I noticed that the Jupyter extensions project was more or less frozen => JupyterLab obviously is trying to shift community attention to npm / nodejs plugins.

Again (like 6-12 months ago) I tried to do this.

This time Lab is more mature:

- Now at version >1;

- Now they have built-in package manager;

- They have some of the most necessary extensions (i.e. git, toc, google drive, etc);

- UI got polished a bit, but window in a window still produces a bit of mental friction. Only the most popular file formats are supported. Text editor inherited the best features, but it is still a bit rudimentary;

- Full screen width by default;

- Some useful things (like codefolding) are now turned on in settings json file;

- Using these extensions is a bit of a chore in edge cases (i.e. some user permission problems / you have to re-build an app each time you add an extensions);

But I could not switch mostly for one reason - this one


If you have a Jupyter environment it is very easy to switch. For me, before it was:

# 5.6 because otherwise I have a bug with installing extensions
RUN conda install notebook=5.6

RUN pip install git+ && \
jupyter contrib nbextension install --user

CMD jupyter notebook --port=8888 --ip= --no-browser

And it just became:

RUN conda install -c conda-forge jupyterlab

CMD jupyter lab --port=8888 --ip= --no-browser


Support collapsible hierarchy of sections · Issue #2275 · jupyterlab/jupyterlab

Allow users to toggle open/close sections by clicking on some kind of UI element. This helps with navigating and organizing large notebooks.

snakers4 (Alexander), July 02, 07:34

New version of our open STT dataset - 0.5, now in beta

Please share and repost!

What is new?

- A new domain - radio (1000+ new hours);

- A larger YouTube dataset with 1000+ additional hours;

- A small (300 hours) YouTube dataset downloaded in maximum quality;

- Ground truth validation sets for YouTube / books / public calls manually annotated;

- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;

I'm back from vacation)





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 20, 06:21

New in our Open STT dataset

- An mp3 version of the dataset;

- A torrent for mp3 dataset;

- A torrent for the original wav dataset;

- Benchmarks on the public dataset / files with "poor" annotation marked;





Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), May 09, 11:28 / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made post:


Also our post was accepted to an editor's pick part of TDS:


Share / give us a star / clap if you have not already!

Original release




Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе занима...

snakers4 (Alexander), May 02, 06:02

Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.

It was a lot of work.

The dataset:

Accompanying post:


- On third release, we have ~4000 hours;

- Contributors and help wanted;

- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.






Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.

snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page[]=year%3A%222018%22

Donate them here




Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...

snakers4 (Alexander), April 17, 08:47

Using snakeviz for profiling Python code


To profile complicated and convoluted code.

Snakeviz is a cool GUI tool to analyze cProfile profile files.

Just launch your code like this

python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.


They have a server GUI and a jupyter notebook plugin.

Also you can launch their tool from within a docker container:

snakeviz -s -H profile_file.cprofile

Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.



SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.

snakers4 (Alexander), March 26, 04:44

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.

Luckily, some anonymous backed the dataset up.

Anyway - use it.

Yeah, it is small. But it is free, so whatever.



Download Dataset.tar.gz 1.57 MB

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:


2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;



2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;




2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - Блог -

snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;




2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - Блог -

snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;




2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - Блог -

snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;




2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - Блог -

snakers4 (Alexander), January 31, 09:41

Second 2019 DS / ML digest

Highlight of the week - Facebook's LASER.




2019 DS/ML digest 02

2019 DS/ML digest 02 Статьи автора - Блог -

snakers4 (Alexander), January 31, 08:38

Jupiter widgets + pandas

With the @interact decorator, the IPywidgets library automatically gives us a text box and a slider for choosing a column and number! It looks at the inputs



Interactive Controls in Jupyter Notebooks

How to use IPywidgets to enhance your data exploration and analysis

snakers4 (Alexander), January 30, 12:06

Serialization of large objects in Python

So far found no sane way for this with 1M chunks / 10GB+ object size.

Of course, chunking / plain txt works.

Feather / parquet - fail with 2+GB size.

Pickle works, but it is kind of slow.



snakers4 (Alexander), January 15, 08:33

First 2019 DS / ML digest

No particular highlights - just maybe ML industrialization vector is here to stay?




2019 DS/ML digest 01

2019 DS/ML digest 01 Статьи автора - Блог -

snakers4 (Alexander), December 30, 2018

Spark in me 2018 annual retrospective


- My personal progress and some views;

- ML is still amazing, but there are no illusions anymore;

- Telegram is still amazing, but commercialization looms;

- FAIR is an inspiration;

- Imcinnes with UMAP and HDBSCAN as well;


Еще написал немного по-русски, немного со спецификой, если вам так удобнее



Spark in me - annual retrospective 2018

Spark in me - annual retrospective 2018 Статьи автора - Блог -

snakers4 (Alexander), December 19, 2018

DS/ML digest 32


- A way to replace softmax in NMT;

- Large visual reasoning dataset;

- PyText;




2018 DS/ML digest 32

2018 DS/ML digest 32 Статьи автора - Блог -

snakers4 (Alexander), December 10, 2018

Simpsons paradox

Nice explanation


Simpson’s Paradox and Interpreting Data

The challenge of finding the right view through data

snakers4 (Alexander), December 09, 2018

DS/ML digest 31

Highlights of the week:

- PyTorch 1.0 released;

- Drawing with GANs;

- BERT explained;




2018 DS/ML digest 31

2018 DS/ML digest 31 Статьи автора - Блог -

snakers4 (Alexander), December 02, 2018

A cheeky ML/DS themed sticker pack for our channel

Thanks to @birdborn for his art.

You are welcome to use it:

If you would like to contribute / create your own stickers - please ask around in our channel chat.


snakers4 (Alexander), November 28, 2018

DS/ML digest 30




2018 DS/ML digest 30

2018 DS/ML digest 30 Статьи автора - Блог -

snakers4 (Alexander), November 23, 2018

Jupyter extensions

Looks like they are near end of their support.


On a fresh build you will need this

conda install notebook=5.6

To use them.

Will need to invest some time into making Jupyter Lab actually usable.


snakers4 (Alexander), November 22, 2018

Our victory in CFT-2018 competition


- Multi-task learning + seq2seq models rule;

- The domain seems to be easy, but it is not;

- You can also build a pipeline based on manual features, but it will not be task agnostic;

- Loss weighting is crucial for such tasks;

- Transformer trains 10x longer;




Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain Статьи автора - Блог -

older first