Andrew NG released first 4 chapters of his new book
So far looks not really technical
Andrew NG released first 4 chapters of his new book
So far looks not really technical
DS Bowl 2018 top solution
This is really interesting...their approach to separation is cool
A draft of the article about DS Bowl 2018 on Kaggle.
This time this was a lottery.
Good that I did not really spend much time, but this time I learned a lot about watershed and some other instance segmentation methods!
An article is accompanied by a dockerized PyTorch code release on GitHub:
This is a beta, you are welcome to comment and respond.
In this article I will describe my solution to the DS Bowl 2018 and why it was a lottery and post a link to my dockerized solution Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me
2018 DS/ML digest 8
As usual my short bi-weekly (or less) digest of everything that passed my BS detector
Market / blog posts
(0) Fast.ai about the importance of accessibility in ML - www.fast.ai/2018/04/10/stanford-
(1) Some interesting news about market, mostly self-driving cars (the rest is crap) - goo.gl/VKLf48
(2) US$600m investment into Chinese face recognition - goo.gl/U4k2Mg
Libraries / frameworks / tools
(0) New 5 point face detector in Dlib for face alignment task - goo.gl/T73nHV
(3) CNNs on FPGAs by ZFTurbo
(4) Data version control - looks cool
-- but I will not use it - becasuse proper logging and treating data as immutable solves the issue
-- looks like over-engineering for the sake of overengineering (unless you create 100500 datasets per day)
(0) TF Playground to seed how simplest CNNs work - goo.gl/cu7zTm
(0) Looks like GAN + ResNet + Unet + content loss - can easily solve simpler tasks like deblurring goo.gl/aviuNm
(1) You can apply dilated convolutions to NLP tasks - habrahabr.ru/company/ods/blog/35
(2) High level overview of face detection in ok.ru - goo.gl/fDUXa2
(3) Alternatives to DWT and Mask-RCNN / RetinaNet? medium.com/@barvinograd1/instanc
- Has anybody tried anything here?
(0) A more disciplined approach to training CNNs - arxiv.org/abs/1803.09820 (LR regime, hyper param fitting etc)
(1) GANS for iamge compression - arxiv.org/pdf/1804.02958.pdf
(2) Paper reviews from ODS - mostly moonshots, but some are interesting
(3) SqueezeNext - the new SqueezeNet - arxiv.org/abs/1803.10615
DS Bowl 2018 stage 2 data was released.
It has completely different distribution from stage 1 data.
How do you like them, apples?
Looks like Kaggle admins really have no idea about dataset curation, or all of this is mean to misguide manual annotators.
Anyway - looks like random bs.
Yolov3 - best paper.
But not in terms of scientific contribution, but rebuttal of DS community BS.
Very funny read.
If you want a proper comparison of object detection algorithms - use this paper arxiv.org/abs/1611.10012
Looks like SSD and YOLO are reasonably good and fast, and RCNN can be properly tuned to be 3-5x slower (not 100x) and more accurate.
As you may know (for newer people on the channel), sometimes we publish small articles on the website.
This time it covers a recent Power Laws challenge on DrivenData, which at first seemed legit and cool, but in the end turned back into a pumpkin.
Here is an article:
In this article I share our experience participating in a recent time series challenge on Drivendata and my personal ideas about ML competitions Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me
NLP project peculiarities
(0) Always handle new words somehow
(1) Easy evaluation of test results - you can just look at it
(2) Key difference is always in the domain - short or long sequences / sentences / whole documents - require different features / models / transfer learning
Basic Approaches to modern NLP projects
(0) Basic pipeline
(1) Basic preprocessing
- Stemming / lemmatization
- Regular expressions
(2) Naive / old school approaches that can just work
- Bag of Words => simple model
- Bag of Words => tf-idf => SVD / PCA / NMF => simple model
- Average / sum of Word2Vec embeddings
- Word2Vec * tf-idf >> Doc2Vec
- Small documents => embeddings work better
- Big documents => bag of features / high level features
(4) Sentiment analysis features
- n-chars => won several Kaggle competitions
(5) Also a couple of articles for developing intuition for sentence2vec
(6) Transfer learning in NLP - looks like it may become more popular / prominent
- Jeremy Howard's preprint on NLP transfer learning - arxiv.org/abs/1801.06146
So, I have briefly watched Andrew Ng's series on RNNs.
It's super cool if you do not know much about RNNs and / or want to refresh your memories and / or want to jump start your knowledge about NLP.
Also he explains stuff with really simple and clear illustrations.
Tasks in the course are also cool (notebook + submit button), but they are very removed from real practice, as they imply coding gradients and forwards passes from scratch in python.
(which I did enough during his classic course)
Also no GPU tricks / no research / production boilerplate ofc.
Below are ideas and references that may be useful to know about NLP / RNNs for everyone.
Also for NLP:
(0) Key NLP sota achievements in 2017
(1) Consider fast.ai courses and notebooks github.com/fastai/courses/tree/m
(2) Consider NLP newsletter newsletter.ruder.io
(3) Consider excellent PyTorch tutorials pytorch.org/tutorials/
(4) There is a lot quality code in PyTorch community (e.g. 1-2 page GloVe implementations!)
(5) Brief 1-hour intro to practical NLP www.youtube.com/watch?v=Ozm0bEi5
Also related posts on the channel / libraries:
(1) Pre-trained vectors in Russian - snakers41.spark-in.me/1623
(2) How to learn about CTC loss snakers41.spark-in.me/1690 (when our seq2seq )
(3) Most popular MLP libraries for English - snakers41.spark-in.me/1832
(4) NER in Russian - habrahabr.ru/post/349864/
(5) Lemmatization library in Russian - pymorphy2.readthedocs.io/en/late
Basic tasks considered more or less solved by RNNs
(1) Speech recognition / trigger word detection
(2) Music generation
(3) Sentiment analysis
(4) Machine translation
(5) Video activity recognition / tagging
(6) Named entity recognition (NER)
Problems with standard CNN when modelling sequences:
(1) Different length of input and output
(2) Features for different positions in the sequence are not shared
(3) Enormous number of params
Typical word representations
(1) One-hot encoded vectors (10-50k typical solutions, 100k-1m commerical / sota solutions)
(2) Learned embeddings - reduce computation burden to ~300-500 dimensions instead of 10k+
Typical rules of thumb / hacks / practical approaches for RNNs
(0) Typical architectures - deep GRU (lighter) and LSTM cells
(1) Tanh or RELU for hidden layer activation
(2) Sigmoid for output when classifying
(3) Usage of EndOfSentence, UNKnown word, StartOfSentence, etc tokens
(4) Usually word level models are used (not character level)
(5) Passing hidden state in encoder-decoder architectures
(6) Vanishing gradients - typically GRUs / LSTMs are used
(7) Very long sequences for time series - AR features are used instead of long windows, typically GRUs and LSTMs are good for sequence of 200-300 (easier and more straightforward than attention in this case)
(8) Exploding gradients - standard solution - clipping, though it may lead to inferior results (from practice)
(9) Teacher forcing - substitute predicted y_t+1 with real value when training seq2seq model during forward pass
(10) Peephole conntections - let GRU or LSTM see the c_t-1 from the previous hidden state
(11) Finetune imported embeddings for smaller tasks with smaller datasets
(12) On big datasets - may make sense to learn embeddings from scratch
(13) Usage of bidirectional LSTMs / GRUs / Attention where applicable
Typical similarity functions for high-dim vectors
(0) Cosine (angle)
Seminal papers / consctructs / ideas:
(1) Training embeddings - the later the methods came out - the simpler they are
- Matrix factorization techniques
- Naive approach using language model + softmax (non tractable for large corpuses)
- Negative sampling + skip gram + logistic regression = Word2Vec (2013)
-- useful ideas
-- if there is information - a simple model (i.e. logistic regression) will work
-- negative subsampling -
sample words with frequency between uniform and 3/4 power of frequency in corpus to alleviate frequent words
-- train only a limited number of classifiers (i.e. 5-15, 1 positive sample + k negative) on each update
-- skip-gram model in a nutshell - prntscr.com/iwfwb2
- GloVe - Global Vectors (2014)
-- supposedly GloVe is better given same resources than Word2Vec - prntscr.com/iwf9bx
-- in practice word vectors with 200 dimensions are enough for applied tasks
-- considered to be one of sota solutions now (afaik)
(2) BLEU score for translation
- essentially an exp of modified precision index for logs of 4 n-grams
(3) Attention is all you need
To be continued.
Finally a proper LightGBM / XGB / CatBoost practical comparsion!
A video about realistic state of chat-bots (RU)
A practical note on using pd.to_feather()
Works really well, if you have an NVME drive and you want to save a large dataframe to disk in binary format.
If your NVME is properly installed it will give you 1.5-2+GB/s read/write speed, so even if your df is 20+GB in size, it will read literally in seconds.
The ETL process to produce such a df may take minutes.
New cool trick - use
pd.to_feather()instead of pickle or csv
Supposed to work much faster as it dumps the data same way its located in RAM
An article about how to use CLI params in python with argparse
If this is too slow - then just use this as a starter boilerplate
- goo.gl/Bm39Bc (this is how I learned it)
Why do you need this?
- Run long overnight (or even day long) jobs in python
- Run multiple experiments
- Make your code more tractable for other people
- Expose a simple API for others to use
The same can be done via newer frameworks, but why learn an abstraction, that may die soon, instead of using instruments that worked for decades?
2018 DS/ML digest 6
(1) A new amazing post by Google on distil - distill.pub/2018/building-blocks
This is really amazing work, but their notebooks tells me that it is a far cry from being able to be utilized by the community - goo.gl/3c1Fza
This is how the CNN sees the image - goo.gl/S4KT5d
Expect this to be packaged as part of Tensorboard in a year or so)
(1) New landmark dataset by Google - goo.gl/veSEhg - looks cool, but ...
Given that datasets are really huge...~300G
Also also if you win, you will have to buy a ticket to the USA on your money ...
(2) Useful script to download the images goo.gl/JF93Xx
(3) Imagenet for satellite imagery - xviewdataset.org/#register - pre-register
(4) CVPR 2018 for satellite imagery - deepglobe.org/challenge.html
Papers / new techniques
(1) Improving RNN performance via auxiliary loss - arxiv.org/pdf/1803.00144.pdf
(2) Satellite imaging for emergencies - arxiv.org/pdf/1803.00397.pdf
(3) Baidu - neural voice cloning - goo.gl/uJe852
(1) Google TPU benchmarks - goo.gl/YKL9yx
As usual such charts do not show consumer hardware.
My guess is that a single 1080Ti may deliver comparable performance (i.e. 30-40% of it) for ~US$700-1000k, i.e. ~150 hours of rent (this is ~ 1 week!)
Miners say that 1080Ti can work 1-2 years non-stop
(2) MIT and SenseTime announce effort to advance artificial intelligence research goo.gl/MXB3V9
(3) Google released its ML course - goo.gl/jnVyNF - but generally it is a big TF ad ... Andrew Ng is better for grasping concepts
(1) Interesting thing - all ISPs have some preferential agreements between each other - goo.gl/sEvZMN
So, I have benched XGB vs LightGBM vs CatBoost. Also I endured xgb and lgb GPU installation. This is just general usage impression, not a hard benchmark.
My thoughts are below.
(1) Installation - CPU
(all) - are installed via pip or conda in one line
(2) Installation - GPU
(xgb) - easily done via following their instructions, only nvidia drivers required;
(lgb) - easily done on Azure cloud. on Linux requires some drivers that may be lagging. Could not integrate my Dockerfile with their instructions, but their Dockerfile worked perfectly;
(cb) - instructions were too convoluted for me to follow;
(3) Docs / examples
(xgb) - the worst one, fine-tuning guidelines are murky and unpolished;
(lgb) - their python API is not entirely well documented (e.g. some options can be found only on forums), but overall the docs are very decent + some ft hints;
(cb) - overall docs are nice, a lot of simple examples + some boilerplate in .ipynb format;
(xgb) - works poorly. Maybe my params are bad, but out-of-the-box params of sklearn API use 5-10x more time than the rest and lag in accuracy;
(lgb) - best performing one out of the box, fast + accurate;
(cb) - fast but less accurate;
(xgb) - best accuracy
(lgb) - fast, high accuracy
(cb) - fast, worse accuracy
(6) GPU usage
(xgb) - for some reason accuracy when using full GPU options is really really bad. Forum advice does not help.
It is tricky to launch XGB fully on GPU. People report that on the same data CatBoost has inferior quality w/o tweaking (but is faster). LightGBM is reported to be faster and to have the same accuracy.
So I tried adding LighGBM w GPU support to my Dockerfile -
One of the caveats I understood - it supports only older Nvidia drivers, up to 384.
Luckily, there is a Dockerfile by MS that seems to be working (+ jupyter, but I could not install extensions)
LightGBM - A fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine lea...
A great survey - how to work with imbalanced data
Has this happened to you? You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn! This is an example of an imbalanced dataset and the frustrating results it can …
2017 DS/ML digest 5
(1) Hardcore metal + CNNs + style transfer - goo.gl/VHYfHe
(1) Post by Nvidia goo.gl/6Mw4CB
(2) Some links to sota semseg articles
(3) Useful tools for CV - floodfill and grabcut, but guys from Nvidia did not notice ... that road width was in geojson data...
(4) Looks like they replicated the results just for PR, but their masks do not look appealing
Research / papers / libraries
(1) Neural Voice Cloning with a Few Samples - goo.gl/LwmzRf (demos audiodemos.github.io.)
(2) A library for CRFs in Python - goo.gl/cQc8hA
(4) URLs + CNN - malicious link detection - arxiv.org/abs/1802.03162
(1) 3m anime image dataset - www.gwern.net/Danbooru2017
(2) Google HDR dataset - goo.gl/XEL1Fm
(1) Idea - AMT + blockchain - goo.gl/JfzEPV
(2) ARM to make processors for CNNs? - goo.gl/MpdPSB
(3) Google TPU in beta - goo.gl/gRzq9t - very expensive. + Note the rumours that Google's own people do not use their TPU quota
(4) One guy managed to deploy a PyTorch model using ONNX - goo.gl/QD4DkZ
So ofc I tried the new Jupyter lab.
And it is really cool that something so simple / cool / useful is completely free / no strings attached (yet). But I will not use it professionally.
Use my Dockerfile if you want to check it out with my DL environment:
But in a nutshell it worked with jpn params inside the container
CMD jupyter lab --port=8888 --ip=0.0.0.0 --no-browserAnd installation is as easy as
conda install -c conda-forge jupyterlabDocs are a bit sparse for now
But this is a list of reasons, why you might consider sticking to ssh pass-through for auto-complete / terminal and jupyter notebook with extensions:
(0) It is still in beta, so unless your professional path is connected with node-js / web - you better pass now
(1) The existence of amazing extensions for Jupyter notebook that do 95% of what you might need - goo.gl/K86gjp
(2) Built-it terminal is much better than before, but it pales in comparison with Putty or even standard linux shell (autocomplete?)
(3) Some of built-in extensions like image viewer are really useful, but overall the product is a bit beta (which they openly say it is)
And here is why turning Jupyter notebook into a real environment is really cool:
(1) Building everything based on extensions IS REALLY COOL - and in the long run will encourage people to port jupyter extensions and build a really powerful tool. Also this implies diversity and freedom unlike shitty tools like Zeppelin
(2) After some effort, it may really replace terminal, IDE, desktop environment and notebooks for data-oriented people (I guess 6-12 monhts)
(3) Structuring extensions and npm packages lures the most fast developing web-developer community to support the project and provides transparency and clarity
Visualizing Large-scale and High-dimensional Data:
A paper behind an awesome library github.com/lmcinnes/umap
Follows the success of T-SNE, but is MUCH faster
Typical visualization pipeline goo.gl/J2ViqE
Also works awesomely with datashader.org
(1) goo.gl/DbBeNQ (on a machine with 512GB memory, 32 cores at 2.13GHz)
(2) 3m data points * 100 dimensions, LargeVis is up 30x faster at graph construction and 7x at graph visualization
(1) K-nearest neighbor graph = computational bottleneck
(2) T-SNE constructs the graph using the technique of vantage-point trees, the performance of which significantly deteriorates for high dimensions
(3) Parameters of the t-SNE are very sensitive on different data sets
(1) Create a small number of projection trees (similar to random forest). Then for each node of the graph search the neighbors of its neighbors, which are also likely to be candidates of its nearest neighbors
(2) Use SGD (or asyncronous SGD) to minize graph loss goo.gl/ps7EMm
(3) Clever sampling - sample the edges with the probability proportional to their weights and then treat the sampled edges as binary edges. Also sample some negative (not observed) edges
2017 DS/ML digest 4
Applied cool stuff
- How Dropbox build their OCR - via CTC loss - goo.gl/Dumcn9
- CNN forward pass done in Google Sheets - goo.gl/pyr44P
- New Boston Robotics robot - opens doors now - goo.gl/y6G5bo
- Cool but toothless list of jupyter notebooks with illustrations and models modeldepot.io
- Best CNN filter visualization tool ever - ezyang.github.io/convolution-vis
New directions / moonshots / papers
- IMPALA from Google - DMLab-30, a set of new tasks that span a large variety of challenges in a visually unified environment with a common action space
- Trade crypto via RL - goo.gl/NmCQSY?
- SparseNets? - arxiv.org/pdf/1801.05895.pdf
- Use Apple watch data to predict diseases arxiv.org/abs/1802.02511?
- Google - Evolution in auto ML kicks in faster than RL - arxiv.org/pdf/1802.01548.pdf
- R-CNN for human pose estimation + dataset
-- Website + video densepose.org
-- Paper arxiv.org/abs/1802.00434
Google's Colaboratory gives free GPUs?
- Old GPUs
- 12 hours limit, but very cool in theory
Sick sad world
- China has police Google Glass with face recognition goo.gl/qfNGk7
- Why slack sucks - habrahabr.ru/post/348898/
-- Email + google docs is better for real communication
- Globally there are 22k ML developers goo.gl/1Jpt9P
- One more AI chip moonshot - goo.gl/199f5t
- Google made their TPUs public in beta - US$6 per hour
- CNN performance comparable to human level in dermatology (R-CNN) - goo.gl/gtgXVn
- Deep learning is greedy, brittle, opaque, and shallow goo.gl/7amqxB
- One more medical ML investment - US$25m for cancer - goo.gl/anndPP
Article on SpaceNet Challenge Three in Russian on habrhabr - please support us with your comments / upvotes
Also if you missed:
- The original article spark-in.me/post/spacenet-three-
- The original code release github.com/snakers4/spacenet-thr
... and Jeremy Howard from fast.ai retweeted our solution, lol
But to give some idea which pain the TopCoder platform induces on the contestants, you can read
- Data Download guide goo.gl/EME8nA
- Final testing guide goo.gl/DCvTNN
- Code release for their verification process
Useful links about Datashader
- Home - datashader.org
-- OpenSky goo.gl/N3dcgD
-- 300M census data goo.gl/qarBVj
-- NYC Taxi data goo.gl/pyJa6v
- Readme (md is broken) goo.gl/hBZ5D2
- Datashader pipeline - what you need to understand to use it with examples - goo.gl/QHT7W1
Also see 2 images above)
Turns even the largest data into images, accurately.
So, I accidentally was able to talk to the Vice President of GameWorks in Nvidia in person =)
All of this should be taken with a grain of salt. I am not endorsing Nvidia.
- In the public part of the speech he spoke about public Nvidia research projects - most notable / fresh was Nvidia Holodeck - their VR environment
- Key insight - even despite the fact that Rockstar forbid to use GTA images for deep learning, he believes that artificial images used for annotation will be the future of ML because game engines and OS are the most complicated software ever
Obviously, I asked interesting question afterwards =) Most notably about about GPU market and forces
- GameWorks = 200 people doing AR / VR / CNN research
- The biggest team in Nvidia is 2000 - drivers
- Ofc he refused to reply when new generation GPUs will be released and whether the rumour about their current generation GPUs being not produced anymore is true
- He says they are mostly software company focusing on drivers
- Each generation cycle takes 3 years, Nvidia has only one architecture per generation, all the CUDA / ML stuff was planned in 2012-2014
- A rumour about Google TPU. Google has an internal quota - allegedly (!) they cannot buy more GPUs than TPUs, but TPUs are 1% utilized and allegedly they lure Nvidia people to optimize their GPUs to make sure they use this quota efficently
- AMD R&D spend on both CPU and GPU is less than Nvidia spend on GPU
- He says that newest AMD have more 30-40% FLOPs, but they are compared against previous generation consumer GT cards on synthetic tests. AMD does not have a 2000 people driver team...
- He says that Intel has 3-5 new architectures in the works - which may a problem