snakers4 (Alexander), February 15, 09:50

Visualizing Large-scale and High-dimensional Data:

A paper behind an awesome library github.com/lmcinnes/umap

Link arxiv.org/abs/1602.00370

Follows the success of T-SNE, but is MUCH faster

Typical visualization pipeline goo.gl/J2ViqE

Also works awesomely with datashader.org

Convergence speed

(1) goo.gl/DbBeNQ (on a machine with 512GB memory, 32 cores at 2.13GHz)

(2) 3m data points * 100 dimensions, LargeVis is up 30x faster at graph construction and 7x at graph visualization

Examples

(1) goo.gl/zQ5kjC

(2) goo.gl/m7d3AW

(3) goo.gl/PtEGBr

T-SNE drawbacks

(1) K-nearest neighbor graph = computational bottleneck

(2) T-SNE constructs the graph using the technique of vantage-point trees, the performance of which significantly deteriorates for high dimensions

(3) Parameters of the t-SNE are very sensitive on different data sets

Algorithm itself

(1) Create a small number of projection trees (similar to random forest). Then for each node of the graph search the neighbors of its neighbors, which are also likely to be candidates of its nearest neighbors

(2) Use SGD (or asyncronous SGD) to minize graph loss goo.gl/ps7EMm

(3) Clever sampling - sample the edges with the probability proportional to their weights and then treat the sampled edges as binary edges. Also sample some negative (not observed) edges

## lmcinnes/umap