My foray into the STT Dark Forest
My tongue-in-cheek article on ML in general, and how to make your STT model train 3-4x faster with 4-5x less weights with the same quality
Extreme NLP network miniaturization
Tried some plain RNNs on a custom in the wild NER task.
The dataset is huge - literally infinite, but manually generated to mimick in-the-wild data.
I use EmbeddingBag + 1m n-grams (an optimal cut-off). Yeah, on NER / classification it is a handy trick that makes your pipeline totally misprint / error / OOV agnostic. Also FAIR themselves just guessed this too. Very cool! Just add PyTorch and you are golden.
What is interesting:
- Model works with embedding sizes 300, 100, 50 and even 5! 5 is dangerously close to OHE, but doing OHE on 1m n-grams kind-of does not make sense;
- Model works with various hidden sizes
- Naturally all of the models run on CPU very fast, but the smallest model also is very light in terms of its weights;
- The only difference is - convergence time. It kind of scales as a log of model size, i.e. model with 5 takes 5-7x more time to converge compared to model with 50. I wonder what if I use embedding size of 1?;
As added bonus - you can just store such miniature model in git w/o lfs.
What is with training transformers on US$250k worth of compute credits you say?)
Misspelling Oblivious Embeddings (MOE) is a new model for word embeddings that are resilient to misspellings, improving the ability to apply word embeddings to real-world situations, where misspellings are common.
PyTorch 1.2 release
- Tensorboard logging in now out of beta;
- They continue improving JIT and ONNX;
- NN.Transformer is a layer now;
- Looks like SyncBn is also more or less stable;
- nn.Embedding: support float16 embeddings on CUDA;
- Numpy compatibility;
Managing your DS / ML environment neatly and in style
If you have a sophisticated environment that you need to do DS / ML / DL, then using a set of Docker images may be a good idea.
You can also tap into a vast community of amazing and well-maintained Dockerhub repositories (e.g. nvidia, pytorch).
But what you have to do this for several people? And use it with a proper IDE via ssh?
A well-known features of Docker include copy on write and user "forwarding". If you approach naively, each user will store his own images, which take quite some space.
And also you have to make your ssh daemon works inside of a container as a second service.
So I solved these "challenges" and created 2 public layers so far:
- Basic DS / ML layer -
FROM aveysov/ml_images:layer-0 - from dockerfile;
- DS / ML libraries -
FROM aveysov/ml_images:layer-0- from dockerfile;
Your final dockerfile may look something like this just pulling from any of those layers.
Note that when building this, you will need to pass your
UID as a variable, e.g.:
docker build --build-arg NB_UID=1000 -t av_final_layer -f Layer_final.dockerfile .
When launched, this launched a notebook with extensions. You can just
exec into the machine itself to run scripts or use an
ssh daemon inside (do not forget to add your ssh key and
service ssh start).
Using public Dockerhub account for your private small scale deploy
Also a lifehack - you can just use Dockerhub for your private stuff, just separate the public part and the private part.
Push the public part (i.e. libraries and frameworks) to Dockerhub/
You private Dockerfile will be then something like:
COPY your_app_folder your_app_folder
COPY app.py app.py
CMD ["python3", "app.py"]
An ideal remote IDE?
No, looks like VScode recently got its remote development extensions (it was only in insiders build a couple of months ago) working just right.
I tried remote-ssh extension and it looks quite polished. No syncing your large data folders and loading all python dependencies locally for hours.
The problem? It took me an hour just to open ssh session under Windows properly (permissions and Linux folder path substitution is hell on Windows). When I opened it - it worked like a charm.
So for now (it is personal) - best tools are in my opinion:
- Notebooks - for exploration and testing;
- VScode for codebase;
- Atom - for local scripts;
Full IDE in a browser?
You all know all the pros and cons of:
- IDEs (PyCharm);
- Advanced text editors (Atom, Sublime Text);
- Interactive environments (notebook / lab, Atom + Hydrogen);
I personally dislike local IDEs - not because connecting to a remote / remote kernel / remote interpreter is a bit of a chore. Setting up is easy, but always thinking about what is synced and what is not - is just pain. Also when your daily driver machine is on Windows, using Linux subsystem all the time with Windows paths is just pain. (Also I diskile bulky interfaces, but this is just a habit and it depends).
But what if I told you there is a third option? =)
If you work as a team on a remote machine / set of machines?
TLDR - you can run a modern web "IDE" (it is something between Atom and real IDE - less bulky, but less functions) in a browser.
Now you can just run it with one command.
- It is open source (though shipped as a part of some enterprise packages like Eclipse Che);
- Pre-built images available;
- It is extendible - new modules get released - you can build yourself or just find a build;
- It has extensive linting, python language server (just a standard library though);
- It has full text search ... kind of;
- Follow definition in your code;
- Docstrings and auto-complete work for your modules and standard library (not for you packages);
Looks cool af!
If they ship a build with a remote python kernel, then it will be a perfect option for teams!
I hope it will not follow a path taken by another crowd favourite similar web editor (it was purhcased by Amazon).
- Pre-built apps for python;
- Language server they are using;
Theia is an open-source cloud desktop IDE framework implemented in TypeScript.
If you know how to add your python kernel to Theia - please ping me)
Trying to migrate to JupyterLab from Jupyter Notebook?
Some time ago I noticed that the Jupyter extensions project was more or less frozen => JupyterLab obviously is trying to shift community attention to npm / nodejs plugins.
Again (like 6-12 months ago) I tried to do this.
This time Lab is more mature:
- Now at version >1;
- Now they have built-in package manager;
- They have some of the most necessary extensions (i.e. git, toc, google drive, etc);
- UI got polished a bit, but window in a window still produces a bit of mental friction. Only the most popular file formats are supported. Text editor inherited the best features, but it is still a bit rudimentary;
- Full screen width by default;
- Some useful things (like codefolding) are now turned on in settings json file;
- Using these extensions is a bit of a chore in edge cases (i.e. some user permission problems / you have to re-build an app each time you add an extensions);
But I could not switch mostly for one reason - this one
If you have a Jupyter environment it is very easy to switch. For me, before it was:
# 5.6 because otherwise I have a bug with installing extensions
RUN conda install notebook=5.6
RUN pip install git+https://github.com/ipython-contrib/jupyter_contrib_nbextensions && \
jupyter contrib nbextension install --user
CMD jupyter notebook --port=8888 --ip=0.0.0.0 --no-browser
And it just became:
RUN conda install -c conda-forge jupyterlab
CMD jupyter lab --port=8888 --ip=0.0.0.0 --no-browser
Allow users to toggle open/close sections by clicking on some kind of UI element. This helps with navigating and organizing large notebooks.
Installing apex ... in style )
Sometimes you just need to try fp16 training (GANs, large networks, rare cases).
There is no better way to do this than use Nvidia's APEX library.
Luckily - they have very nice examples:
Well ... it installs on a clean machine, but I want my environment to work with this always)
So, I ploughed through all the conda / environment setup mumbo-jumbo and created a version of our deep-learning / ds dockerfile, but now instlalling from pytorch image (pytorch GPU / CUDA / CUDNN + APEX).
It was kind of painful, because PyTorch images already contain conda / pip and it was not apparent at first, causing all sorts of problems with my miniconda instalation.
So use it and please report if it is still buggy.
Logging your hardware, with logs, charts and alers - in style
TLDR - we have been looking for THE software to do this easily, with charts / alerts / easy install.
We found prometheus. Configuring alerts was a bit of a problem, but enjoy:
Yeah, scraping image labels from Google / other social networks is a really cool idea ...
A cool old paper - FCN text detector
They were using multi-layer masks for better semantic segmentation supervision before it was mainstream.
Too bad such models are a commodity now, you can just use pre-trained)
Previous approaches for scene text detection have already achieved promising performances across various benchmarks. However, they usually fall short when dealing with challenging scenarios, even...
New version of our open STT dataset - 0.5, now in beta
Please share and repost!
What is new?
- A new domain - radio (1000+ new hours);
- A larger YouTube dataset with 1000+ additional hours;
- A small (300 hours) YouTube dataset downloaded in maximum quality;
- Ground truth validation sets for YouTube / books / public calls manually annotated;
- Now we will start to focus on actually cleaning and distilling the dataset. We have published a second list of "bad" data;
I'm back from vacation)
2019 DS / ML digest 11
Highlights of the week(s)
- New attention block for CV;
- Reducing the amount of data for CV 10x?;
- Brain-to-CNN interfaces start popping up in the mainstream;
Do not use AllenNLP though
Really working in the wild audio noise reduction libraries
It works. But you need a sample of your noise.
Will work well out of box for larger files / files with gaps where you can pay attention to each file and select a part of file that would act as noise example.
RNNoise: Learning Noise Suppression
Works with any arbitrary noise. Just feed your file.
It works more like adative equalizer.
It filters noise when there is no speech.
But it mostly does not change audio when speech is present.
As authors explain, it improves snr overall and makes sound less "tiring" to listen.
Description / blog posts
Step-by-step instructions in python
New in our Open STT dataset
mp3 version of the dataset;
- A torrent for
- A torrent for the original
- Benchmarks on the public dataset / files with "poor" annotation marked;
SWA in contrib repo of pytorch )
Habr.com / TowardsDataScience post for our dataset
In addition to a github release and a medium post, we also made habr.com post:
Also our post was accepted to an editor's pick part of TDS:
Share / give us a star / clap if you have not already!
PyTorch DP / DDP / model parallel
Finally they made proper tutorials:
Model parallel = have parts of the same model on different devices
Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process
Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine
The State of ML, eof 2018 in Russian
Quite down-to-earth and clever lecture
Some nice examples for TTS and some interesting forecasts (some of them happened already).
- Tensorboard (beta);
- DistributedDataParallel new functionality and tutorials;
- Multi-headed attention;
- EmbeddingBag enhancements;
- Other cool, but more niche features:
Russian Open Speech To Text (STT/ASR) Dataset
4000 hours of STT data in Russian
Made by us. Yes, really. I am not joking.
It was a lot of work.
- On third release, we have ~4000 hours;
- Contributors and help wanted;
- Let's bring the Imagenet moment in STT closer together!;
Please repost this as much as you can.
Poor man's computing cluster
So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).
It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.
So, why I am saying this?
Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.
Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).
Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).
It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.
Now let's crunch the numbers
According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.
So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:
- 20-25% of DGX desktop;
- 1 month of renting on Amazon;
Assuming that all the hardware will just break in a year:
- It is 4-5x cheaper than buying from Nvidia;
- It is 10x cheaper than renting;
If you buy everything used, then it is 10x and 20x cheaper!
I would buy that for a dollar!
Ofc you have to invest your free time.
See my calculations here:
config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...
streaming STT lecture now