Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1812 members, 1753 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- http://spark-in.me
Our chat
- https://t.me/joinchat/Bv9tjkH9JHbxiV5hr91a0w
DS courses review
- http://goo.gl/5VGU5A
- https://goo.gl/YzVUKf

snakers4 (Alexander), July 18, 04:55

An ideal remote IDE?

Joking?

No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.

I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.

The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.

So for now (it is personal) - best tools are in my opinion:

- Notebooks - for exploration and testing;

- VScode for codebase;

- Atom - for local scripts;

#data_science

Visual Studio Code Remote Development


snakers4 (Alexander), July 16, 12:53

Full IDE in a browser?

Almost)

You all know all the pros and cons of:

- IDEs (PyCharm);

- Advanced text editors (Atom, Sublime Text);

- Interactive environments (notebook / lab, Atom + Hydrogen);

I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).

But what if I told you there is a third option? =)

If you work as a team on a remote machine / set of machines?

TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.

Now you can just run it with one command.

Pros:

- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);

- Pre-built images available;

- It is extendible - new modules get released - you can build yourself or just find a build;

- It has extensive linting, python language server (just a standard library though);

- It has full text search ... kind of;

- Follow definition in your code;

- Docstrings and auto-complete work for your modules and standard library (not for you packages);

Looks cool af!

If they ship a build with a remote python kernel, then it will be a perfect option for teams!

I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).

Links

- Website;

- Pre-built apps for python;

- Language server they are using;

#data_science

Theia - Cloud and Desktop IDE

Theia is an open-source cloud   desktop IDE framework implemented in TypeScript.


If you know how to add your python kernel to Theia - please ping me)

snakers4 (Alexander), July 15, 04:48

Trying to migrate to JupyterLab from Jupyter Notebook?

Some time ago I noticed that the Jupyter extensions project was more or less frozen => JupyterLab obviously is trying to shift community attention to npm / nodejs plugins.

Again (like 6-12 months ago) I tried to do this.

This time Lab is more mature:

- Now at version >1;

- Now they have built-in package manager;

- They have some of the most necessary extensions (i.e. git, toc, google drive, etc);

- UI got polished a bit, but window in a window still produces a bit of mental friction. Only the most popular file formats are supported. Text editor inherited the best features, but it is still a bit rudimentary;

- Full screen width by default;

- Some useful things (like codefolding) are now turned on in settings json file;

- Using these extensions is a bit of a chore in edge cases (i.e. some user permission problems / you have to re-build an app each time you add an extensions);

But I could not switch mostly for one reason - this one

- github.com/jupyterlab/jupyterlab/issues/2275#issuecomment-498323475

If you have a Jupyter environment it is very easy to switch. For me, before it was:

# 5.6 because otherwise I have a bug with installing extensions
RUN conda install notebook=5.6

RUN pip install git+https://github.com/ipython-contrib/jupyter_contrib_nbextensions && \
jupyter contrib nbextension install --user

CMD jupyter notebook --port=8888 --ip=0.0.0.0 --no-browser

And it just became:

RUN conda install -c conda-forge jupyterlab

CMD jupyter lab --port=8888 --ip=0.0.0.0 --no-browser

#data_science

Support collapsible hierarchy of sections · Issue #2275 · jupyterlab/jupyterlab

Allow users to toggle open/close sections by clicking on some kind of UI element. This helps with navigating and organizing large notebooks.


snakers4 (Alexander), July 12, 05:25

Installing apex ... in style )

Sometimes you just need to try fp16 training (GANs, large networks, rare cases).

There is no better way to do this than use Nvidia's APEX library.

Luckily - they have very nice examples:

- github.com/NVIDIA/apex/tree/master/examples/docker

Well ... it installs on a clean machine, but I want my environment to work with this always)

So, I ploughed through all the conda / environment setup mumbo-jumbo and created a version of our deep-learning / ds dockerfile, but now instlalling from pytorch image (pytorch GPU / CUDA / CUDNN + APEX).

- github.com/snakers4/gpu-box-setup/blob/master/dockerfile/Dockerfile_apex

It was kind of painful, because PyTorch images already contain conda / pip and it was not apparent at first, causing all sorts of problems with my miniconda instalation.

So use it and please report if it is still buggy.

#deep_learning

#pytorch

NVIDIA/apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch - NVIDIA/apex


Logging your hardware, with logs, charts and alers - in style

TLDR - we have been looking for THE software to do this easily, with charts / alerts / easy install.

We found prometheus. Configuring alerts was a bit of a problem, but enjoy:

- github.com/snakers4/gpu-box-setup#prometheus

#deep_learning

#hardware

snakers4/gpu-box-setup

Contribute to snakers4/gpu-box-setup development by creating an account on GitHub.


snakers4 (Alexander), July 09, 11:29

Yeah, scraping image labels from Google / other social networks is a really cool idea ...

Forwarded from Just links:

twitter.com/wightmanr/status/1148312748121419776

Ross Wightman

@facebookai ResNeXt models pre-trained on Instagram hashtags stand out in their ability to generalized to the 'ImageNetV2' test set. Thanks @beccaroelofs @vaishaal @beenwrekt @lschmidt3 for a useful dataset. #PyTorch https://t.co/5gWx6CgEVg


snakers4 (Alexander), July 03, 13:57

2019 DS / ML digest 12

Highlights of the week(s)

- Cool STT papers;

- End of AI hype?

- How to download tons of images from Google;

spark-in.me/post/2019_ds_ml_digest_12

#digest

#deep_learning

2019 DS/ML digest 12

2019 DS/ML digest 12 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), July 02, 09:15

A cool old paper - FCN text detector

They were using multi-layer masks for better semantic segmentation supervision before it was mainstream.

Very cool!

arxiv.org/abs/1704.03155

Too bad such models are a commodity now, you can just use pre-trained)

#deep_learning

EAST: An Efficient and Accurate Scene Text Detector

Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even...


snakers4 (Alexander), July 02, 07:34

New version of our open STT dataset - 0.5, now in beta

Please share and repost!

github.com/snakers4/open_stt/releases/tag/v0.5-beta

What is new?

- A new domain - radio (1000+ new hours);

- A larger YouTube dataset with 1000+ additional hours;

- A small (300 hours) YouTube dataset downloaded in maximum quality;

- Ground truth validation sets for YouTube / books / public calls manually annotated;

- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;

I'm back from vacation)

#deep_learning

#data_science

#dataset

snakers4/open_stt

Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.


snakers4 (Alexander), June 21, 06:47

Forwarded from Just links:

I've uploaded the weights I've got so far (73.53% top-1)

snakers4 (Alexander), June 17, 12:35

youtu.be/aJq6ygTWdao

This AI Makes Amazing DeepFakes…and More
Check out Lambda Labs here: https://lambdalabs.com/papers 📝 The paper "Deferred Neural Rendering: Image Synthesis using Neural Textures" is available here: h...

snakers4 (Alexander), June 02, 05:20

Forwarded from Just links:

github.com/lukemelas/EfficientNet-PyTorch

lukemelas/EfficientNet-PyTorch

A PyTorch implementation of EfficientNet. Contribute to lukemelas/EfficientNet-PyTorch development by creating an account on GitHub.


snakers4 (Alexander), May 27, 08:43

2019 DS / ML digest 11

Highlights of the week(s)

- New attention block for CV;

- Reducing the amount of data for CV 10x?;

- Brain-to-CNN interfaces start popping up in the mainstream;

spark-in.me/post/2019_ds_ml_digest_11

#digest

#deep_learning

2019 DS/ML digest 11

2019 DS/ML digest 11 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), May 27, 08:17

Do not use AllenNLP though

Forwarded from Neural Networks Engineering:

​​Have finished building demo and landing page for my project on mention classification. The idea of this project is to create a model which can assign some labels to objects based on their mentions in context. Right now it works only for people mentions, but if I find interest in this work, I will extend the model to other types like organizations or events. For now, you can check out the online demo of the neural network.

The current implementation can take account of several mentions at a time, so it can distinguish relevant parts of the context, not just averaging prediction.

It's also open sourced, and built with AllenNLP framework from training to serving. Take a look at it.

More technical details of implementation coming later.

snakers4 (Alexander), May 24, 09:20

Really working in the wild audio noise reduction libraries

Spectral gating

github.com/timsainb/noisereduce

It works. But you need a sample of your noise.

Will work well out of box for larger files / files with gaps where you can pay attention to each file and select a part of file that would act as noise example.

RNNoise: Learning Noise Suppression

Works with any arbitrary noise. Just feed your file.

It works more like adative equalizer.

It filters noise when there is no speech.

But it mostly does not change audio when speech is present.

As authors explain, it improves snr overall and makes sound less "tiring" to listen.

Description / blog posts

- people.xiph.org/~jm/demo/rnnoise/

- github.com/xiph/rnnoise

Step-by-step instructions in python

- github.com/xiph/rnnoise/issues/69

#audio

#deep_learning

timsainb/noisereduce

Noise reduction / speech enhancement for python using spectral gating - timsainb/noisereduce


snakers4 (Alexander), May 22, 15:06

www.youtube.com/watch?v=p1b5aiTrGzY&feature=youtu.be

Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
Statement regarding the purpose and effect of the technology (NB: this statement reflects personal opinions of the authors and not of their organizations) We...

snakers4 (Alexander), May 20, 06:21

New in our Open STT dataset

github.com/snakers4/open_stt#updates

- An mp3 version of the dataset;

- A torrent for mp3 dataset;

- A torrent for the original wav dataset;

- Benchmarks on the public dataset / files with "poor" annotation marked;

#deep_learning

#data_science

#dataset

snakers4/open_stt

Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.


snakers4 (Alexander), May 19, 16:03

Forwarded from Just links:

pytorch.org/blog/stochastic-weight-averaging-in-pytorch/

An open source deep learning platform that provides a seamless path from research prototyping to production deployment.


SWA in contrib repo of pytorch )

snakers4 (Alexander), May 14, 03:40

2019 DS / ML digest 10

Highlights of the week(s)

- New MobileNet;

- New PyTorch release;

- Practical GANs?;

spark-in.me/post/2019_ds_ml_digest_10

#digest

#deep_learning

2019 DS/ML digest 10

2019 DS/ML digest 10 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), May 09, 11:28

Habr.com / TowardsDataScience post for our dataset

In addition to a github release and a medium post, we also made habr.com post:

- habr.com/ru/post/450760/

Also our post was accepted to an editor's pick part of TDS:

- bit.ly/ru_open_stt

Share / give us a star / clap if you have not already!

Original release

github.com/snakers4/open_stt/

#deep_learning

#data_science

#dataset

Огромный открытый датасет русской речи

Специалистам по распознаванию речи давно не хватало большого открытого корпуса устной русской речи, поэтому только крупные компании могли позволить себе занима...


snakers4 (Alexander), May 09, 10:51

PyTorch DP / DDP / model parallel

Finally they made proper tutorials:

- pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

- pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

- pytorch.org/tutorials/intermediate/ddp_tutorial.html

Model parallel = have parts of the same model on different devices

Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process

Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine

#deep_learning

The State of ML, eof 2018 in Russian

Quite down-to-earth and clever lecture

www.youtube.com/watch?v=l6djLCYnOKw

Some nice examples for TTS and some interesting forecasts (some of them happened already).

#deep_learning

Сергей Марков: "Искусственный интеллект и машинное обучение: итоги 2018 года."
Лекция состоялась в научно-популярном лектории центра "Архэ" (http://arhe.msk.ru) 16 января 2019 года. Лектор: Сергей Марков — автор одной из сильнейших росс...

snakers4 (Alexander), May 03, 08:58

PyTorch

PyTorch 1.1

github.com/pytorch/pytorch/releases/tag/v1.1.0

- Tensorboard (beta);

- DistributedDataParallel new functionality and tutorials;

- Multi-headed attention;

- EmbeddingBag enhancements;

- Other cool, but more niche features:

- nn.SyncBatchNorm;

- optim.lr_scheduler.CyclicLR;

#deep_learning

pytorch/pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration - pytorch/pytorch


snakers4 (Alexander), May 02, 06:02

Russian Open Speech To Text (STT/ASR) Dataset

4000 hours of STT data in Russian

Made by us. Yes, really. I am not joking.

It was a lot of work.

The dataset:

github.com/snakers4/open_stt/

Accompanying post:

spark-in.me/post/russian-open-stt-part1

TLDR:

- On third release, we have ~4000 hours;

- Contributors and help wanted;

- Let's bring the Imagenet moment in STT closer together!;

Please repost this as much as you can.

#stt

#asr

#data_science

#deep_learning

snakers4/open_stt

Russian open STT dataset. Contribute to snakers4/open_stt development by creating an account on GitHub.


snakers4 (Alexander), May 02, 05:41

Poor man's computing cluster

So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).

It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.

So, why I am saying this?

Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.

Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).

Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).

It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.

Now let's crunch the numbers

According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.

So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:

- 20-25% of DGX desktop;

- 1 month of renting on Amazon;

Assuming that all the hardware will just break in a year:

- It is 4-5x cheaper than buying from Nvidia;

- It is 10x cheaper than renting;

If you buy everything used, then it is 10x and 20x cheaper!

I would buy that for a dollar!

Ofc you have to invest your free time.

See my calculations here:

bit.ly/spark00001

#deep_learning

#hardware

computing_cluster

config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...


snakers4 (Alexander), May 01, 05:49

correct link

streaming STT lecture now

www.youtube.com/watch?v=JpS0LzEWr-4

Deep Learning на пальцах 11 - Аудио и распознавание речи (Юрий Бабуров)
Курс: http://dlcourse.ai Слайды: https://www.dropbox.com/s/tv3cv0ihq2l0u9f/Lecture%2011%20-%20Audio%20and%20Speech.pdf?dl=0

snakers4 (Alexander), April 30, 09:27

Tricky rsync flags

Rsync is the best program ever.

I find these flags the most useful

--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)

Sometimes first three flags get confusing.

#linux

More about STT from also us ... soon)

Forwarded from Yuri Baburov:

Вторая экспериментальная гостевая лекция курса.

Один из семинаристов курса, Юрий Бабуров, расскажет о распознавании речи и работе с аудио.

1-го мая в 8:40 Мск (12:40 Нск, 10:40 вечера 30-го апреля по PST).

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)

www.youtube.com/watch?v=wm4H2Ym33Io

Deep Learning на пальцах 11 - Аудио и Speech Recognition (Юрий Бабуров)
Курс: http://dlcourse.ai

snakers4 (Alexander), April 22, 11:44

Cool docker function

View aggregate load stats by container

docs.docker.com/engine/reference/commandline/stats/

#linux

docker stats

Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...


2019 DS / ML digest 9

Highlights of the week

- Stack Overlow survey;

- Unsupervised STT (ofc not!);

- A mix between detection and semseg?;

spark-in.me/post/2019_ds_ml_digest_09

#digest

#deep_learning

2019 DS/ML digest 09

2019 DS/ML digest 09 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page

archive.org/details/twitterstream?and[]=year%3A%222018%22

Donate them here

archive.org/donate/

#data_science

#nlp

#nlp

Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...


snakers4 (Alexander), April 17, 08:47

Using snakeviz for profiling Python code

Why

To profile complicated and convoluted code.

Snakeviz is a cool GUI tool to analyze cProfile profile files.

jiffyclub.github.io/snakeviz/

Just launch your code like this

python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.

GUI

They have a server GUI and a jupyter notebook plugin.

Also you can launch their tool from within a docker container:

snakeviz -s -H 0.0.0.0 profile_file.cprofile

Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.

#data_science

SnakeViz

SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.


snakers4 (Alexander), April 14, 06:59

PyTorch DataParallel scalability

TLDR - it works fine for 2-3 GPUs.

For more GPUs - use DDP.

github.com/NVIDIA/sentiment-discovery/blob/master/analysis/scale.md

github.com/SeanNaren/deepspeech.pytorch/issues/211

#deep_learning

NVIDIA/sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification - NVIDIA/sentiment-discovery


snakers4 (Alexander), April 09, 06:00

2019 DS / ML digest number 8

Highlights of the week

- Transformer from Facebook with sub-word information;

- How to generate endless sentiment annotation;

- 1M breast cancer images;

spark-in.me/post/2019_ds_ml_digest_08

#digest

#deep_learning

2019 DS/ML digest 08

2019 DS/ML digest 08 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


older first