New in our Open STT dataset
mp3 version of the dataset;
- A torrent for
- A torrent for the original
- Benchmarks on the public dataset / files with "poor" annotation marked;
SWA in contrib repo of pytorch )
Habr.com / TowardsDataScience post for our dataset
In addition to a github release and a medium post, we also made habr.com post:
Also our post was accepted to an editor's pick part of TDS:
Share / give us a star / clap if you have not already!
PyTorch DP / DDP / model parallel
Finally they made proper tutorials:
Model parallel = have parts of the same model on different devices
Data Parallel (DP) = wrapper to use multi-GPU withing a single parent process
Distributed Data Parallel = multiple processes are spawned across cluster / on the same machine
The State of ML, eof 2018 in Russian
Quite down-to-earth and clever lecture
Some nice examples for TTS and some interesting forecasts (some of them happened already).
- Tensorboard (beta);
- DistributedDataParallel new functionality and tutorials;
- Multi-headed attention;
- EmbeddingBag enhancements;
- Other cool, but more niche features:
Russian Open Speech To Text (STT/ASR) Dataset
4000 hours of STT data in Russian
Made by us. Yes, really. I am not joking.
It was a lot of work.
- On third release, we have ~4000 hours;
- Contributors and help wanted;
- Let's bring the Imagenet moment in STT closer together!;
Please repost this as much as you can.
Poor man's computing cluster
So, when I last checked, Amazon's p3.4xlarge instances cost around US$12 per hour (unless you reserve them for a year). A tower supercomputer from Nvidia costs probably US$40-50k or more (it was announced at around US$69k).
It is not difficult to crunch the numbers and see, that 1 month of renting such a machine would cost at least US$8-10k. Also there will the additional cost / problem of actually storing your large datasets. When I last used Amazon - their cheap storage was sloooooow, and fast storage was prohibitively expensive.
So, why I am saying this?
Let's assume (according to my miner friends' experience) - that consumer Nvidia GPUs can work 2-3 years non-stop given proper cooling and care (test before buying!). Also let's assume that 4xTesla V100 is roughly the same as 7-8 * 1080Ti.
Yeah, I know that you will point out at least one reason why this does not hold, but for practical purposes this is fine (yes, I know that Teslas have some cool features like Nvlink).
Now let me drop the ball - modern professional motherboards often boast 2-3 Ethernet ports. And sometimes you can even get 2x10Gbit/s ports (!!!).
It means, that you actually can connect at least 2 (or maybe you can daisy chain them?) machines into a computing cluster.
Now let's crunch the numbers
According to quotes I collected through the years, you can build a cluster roughly equivalent to Amazon's p3.4xlarge for US$10k (but with storage!) with used GPUs (miners sell them like crazy now). If you buy second market drives, motherboards, CPUs and processors you can lower the cost to US$5k or less.
So, a cluster, that would serve you at least one year (if you test everything properly and take care of it) costing US$10k is roughly equivalent to:
- 20-25% of DGX desktop;
- 1 month of renting on Amazon;
Assuming that all the hardware will just break in a year:
- It is 4-5x cheaper than buying from Nvidia;
- It is 10x cheaper than renting;
If you buy everything used, then it is 10x and 20x cheaper!
I would buy that for a dollar!
Ofc you have to invest your free time.
See my calculations here:
config Server,Part,Approx quote,Quote date,Price, USD,Comment,RUR/USD,65,Yes, I know that you should have historical exchange rates 1,Thermaltake Core X9 Black,12,220,11/22/2018,188 1,Gigabyte X399 AORUS XTREMESocket TR4, AMD X399, 8xDDR-4, 7.1CH, 2x1000 Мбит/с, 10000 Мбит/с, Wi-Fi, Bluetooth, U...
streaming STT lecture now
Tricky rsync flags
Rsync is the best program ever.
I find these flags the most useful
--ignore-existing (ignores existing files)
--update (updates to newer versions of files based on ts)
--size-only (uses file-size to compare files)
-e 'ssh -p 22 -i /path/to/private/key' (use custom ssh identity)
Sometimes first three flags get confusing.
More about STT from also us ... soon)
Cool docker function
View aggregate load stats by container
Description Display a live stream of container(s) resource usage statistics Usage docker stats [OPTIONS] [CONTAINER...] Options Name, shorthand Default Description --all , -a Show all containers (default shows just running)...
2019 DS / ML digest 9
Highlights of the week
- Stack Overlow survey;
- Unsupervised STT (ofc not!);
- A mix between detection and semseg?;
Archive team ... makes monthly Twitter archives
With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.
Just pay a visit to archive team page
Donate them here
A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...
Using snakeviz for profiling Python code
To profile complicated and convoluted code.
Snakeviz is a cool GUI tool to analyze cProfile profile files.
Just launch your code like this
python3 -m cProfile -o profile_file.cprofile
And then just analyze with snakeviz.
They have a server GUI and a jupyter notebook plugin.
Also you can launch their tool from within a docker container:
snakeviz -s -H 0.0.0.0 profile_file.cprofile
Do not forget to
EXPOSE necessary ports. SSH tunnel to a host is also an option.
SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.
PyTorch DataParallel scalability
TLDR - it works fine for 2-3 GPUs.
For more GPUs - use DDP.
2019 DS / ML digest number 8
Highlights of the week
- Transformer from Facebook with sub-word information;
- How to generate endless sentiment annotation;
- 1M breast cancer images;
Finally! Cool features like SyncBN or CyclicLR migrate to Pytorch!
Miniaturize / optimize your ... NLP models?
For CV applications there literally dozens of ways to make your models smaller.
And yeah, I do not mean some "moonshots" or special limited libraries (matrix decompositions, some custom pruning, etc etc).
I mean cheap and dirty hacks, that work in 95% of cases regardless of your stack / device / framework:
- Smaller images (x3-x4 easy);
- FP16 inference (30-40% maybe);
- Knowledge distillation into smaller networks (x3-x10);
- Naïve cascade optimizations (feed only Nth frame using some heuristic);
But what can you do with NLP networks?
Turns out not much.
But here are my ideas:
- Use a simpler model - embedding bag + plain self-attention + LSTM can solve 90% of tasks;
- Decrease embedding size from 300 to 50 (or maybe even more). Tried and tested, works like a charm. For harder tasks you lose just 1-3pp of your target metric, for smaller tasks - it is just the same;
- FP16 inference is supported in PyTorch for
nn.Embedding, but not for
nn.EmbeddingBag. But you get the idea;
- You can try distilling your vocabulary / embedding-bag model into a char level model. If it works, you can trade model size vs. inference time;
_embedding_bag is not implemented for type torch.HalfTensor
- If you have very long sentences or large batches - try distilling / swapping your recurrent network with a CNN / TCN. This way you can also trade model size vs. inference time but probably in a different direction;
Good old OLS regression
I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.
Found some nice
statsmodels examples here:
2019 DS / ML digest number 7
Highlights of the week
- NN normalization techniques (not batch norm);
- Jetson nano for US$99 released;
- A bitter lesson in AI;
Wow, we are not alone with our love for Embedding bag!
... or you can just extend
collate_fn that is passed to DataLoader in pytorch =)
Normalization techniques other than batch norm:
Weight normalization (used in TCN arxiv.org/abs/1602.07868):
- Decouples length of weight vectors from their direction;
- Does not introduce any dependencies between the examples in a minibatch;
- Can be applied successfully to recurrent models such as LSTMs;
- Tested only on small datasets (CIFAR + VAES + DQN);
Instance norm (used in [style transfer](arxiv.org/abs/1607.08022))
- Proposed for style transfer;
- Essentially is batch-norm for one image;
- The mean and standard-deviation are calculated per-dimension separately for each object in a mini-batch;
Layer norm (used in Transformers, [paper](arxiv.org/abs/1607.06450))
- Designed especially for sequntial networks;
- Computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case;
- The mean and standard-deviation are calculated separately over the last certain number dimensions;
- Unlike Batch Normalization and Instance Normalization, which applies scalar scale and bias for each entire channel/plane with the affine option, Layer Normalization applies per-element scale and bias;
6th 2019 DS / ML digest
Highlights of the week
- Cool python features;
- Google's on-device STT;
- Why Facebook invested so much in PyTorch 1.0;
New video from 3B1B
Which is kind of relevant
Our Transformer post was featured by Towards Data Science
New tricks for training CNNs