Spark in me - Internet, data science, math, deep learning, philo

snakers4 @ telegram, 1797 members, 1726 posts since 2016

All this - lost like tears in rain.

Data science, ML, a bit of philosophy and math. No bs.

Our website
- spark-in.me
Our chat
- t.me/joinchat/Bv9tjkH9JHYvOr92hi5LxQ
DS courses review
- goo.gl/5VGU5A
- goo.gl/YzVUKf

Posts by tag «data_science»:

snakers4 (Alexander), April 17, 08:55

Archive team ... makes monthly Twitter archives

With all the BS with politics / "Russian hackers" / Arab spring - twitter how has closed its developer API.

No problem.

Just pay a visit to archive team page

archive.org/details/twitterstream?and[]=year%3A%222018%22

Donate them here

archive.org/donate/

#data_science

#nlp

#nlp

Archive Team: The Twitter Stream Grab : Free Web : Free Download, Borrow and Streaming : Internet Archive

A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory. This is the Spritzer version, the most light and shallow of Twitter grabs. Unfortunately, we do not currently have access to the Sprinkler or Garden Hose versions of the...


snakers4 (Alexander), April 17, 08:47

Using snakeviz for profiling Python code

Why

To profile complicated and convoluted code.

Snakeviz is a cool GUI tool to analyze cProfile profile files.

jiffyclub.github.io/snakeviz/

Just launch your code like this

python3 -m cProfile -o profile_file.cprofile

And then just analyze with snakeviz.

GUI

They have a server GUI and a jupyter notebook plugin.

Also you can launch their tool from within a docker container:

snakeviz -s -H 0.0.0.0 profile_file.cprofile

Do not forget to EXPOSE necessary ports. SSH tunnel to a host is also an option.

#data_science

SnakeViz

SnakeViz is a browser based graphical viewer for the output of Python's cProfile module.


snakers4 (Alexander), March 26, 04:44

Russian sentiment dataset

In a typical Russian fashion - one of these datasets was deleted by the request of bad people, whom I shall not name.

Luckily, some anonymous backed the dataset up.

Anyway - use it.

Yeah, it is small. But it is free, so whatever.

#nlp

#data_science

Download Dataset.tar.gz 1.57 MB

snakers4 (Alexander), March 25, 05:31

Good old OLS regression

I needed some quick boilerplate to create an OLS regression with confidence intervals for a very plain task.

Found some nice statsmodels examples here:

www.statsmodels.org/devel/examples/notebooks/generated/ols.html

#data_science

2019 DS / ML digest number 7

Highlights of the week

- NN normalization techniques (not batch norm);

- Jetson nano for US$99 released;

- A bitter lesson in AI;

spark-in.me/post/2019_ds_ml_digest_07

#digest

#deep_learning

2019 DS/ML digest 07

2019 DS/ML digest 06 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 18, 06:18

6th 2019 DS / ML digest

Highlights of the week

- Cool python features;

- Google's on-device STT;

- Why Facebook invested so much in PyTorch 1.0;

spark-in.me/post/2019_ds_ml_digest_06

#digest

#data_science

#deep_learning

2019 DS/ML digest 06

2019 DS/ML digest 06 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), March 06, 10:31

5th 2019 DS / ML digest

Highlights of the week

- New Adam version;

- POS tagging and semantic parsing in Russian;

- ML industrialization again;

spark-in.me/post/2019_ds_ml_digest_05

#digest

#data_science

#deep_learning

2019 DS/ML digest 05

2019 DS/ML digest 05 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 18, 09:24

4th 2019 DS / ML digest

Highlights of the week

- OpenAI controversy;

- BERT pre-training;

- Using transformer for conversational challenges;

spark-in.me/post/2019_ds_ml_digest_04

#digest

#data_science

#deep_learning

2019 DS/ML digest 04

2019 DS/ML digest 04 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), February 08, 10:11

Third 2019 DS / ML digest

Highlights of the week

- quaternions;

- ODEs;

spark-in.me/post/2019_ds_ml_digest_03

#digest

#data_science

#deep_learning

2019 DS/ML digest 03

2019 DS/ML digest 03 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), January 31, 09:41

Second 2019 DS / ML digest

Highlight of the week - Facebook's LASER.

spark-in.me/post/2019_ds_ml_digest_02

#digest

#data_science

#deep_learning

2019 DS/ML digest 02

2019 DS/ML digest 02 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), January 31, 08:38

Jupiter widgets + pandas

towardsdatascience.com/interactive-controls-for-jupyter-notebooks-f5c94829aee6

With the @interact decorator, the IPywidgets library automatically gives us a text box and a slider for choosing a column and number! It looks at the inputs

Amazing.

#data_science

Interactive Controls in Jupyter Notebooks

How to use IPywidgets to enhance your data exploration and analysis


snakers4 (Alexander), January 30, 12:06

Serialization of large objects in Python

So far found no sane way for this with 1M chunks / 10GB+ object size.

Of course, chunking / plain txt works.

Feather / parquet - fail with 2+GB size.

Pickle works, but it is kind of slow.

=(

#data_science

snakers4 (Alexander), January 15, 08:33

First 2019 DS / ML digest

No particular highlights - just maybe ML industrialization vector is here to stay?

spark-in.me/post/2019_ds_ml_digest_01

#digest

#deep_learning

#data_science

2019 DS/ML digest 01

2019 DS/ML digest 01 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), December 30, 2018

Spark in me 2018 annual retrospective

TLDR:

- My personal progress and some views;

- ML is still amazing, but there are no illusions anymore;

- Telegram is still amazing, but commercialization looms;

- FAIR is an inspiration;

- Imcinnes with UMAP and HDBSCAN as well;

spark-in.me/post/2018

ЗЫ

Еще написал немного по-русски, немного со спецификой, если вам так удобнее

tinyletter.com/snakers41/letters/spark-in-me-2018

#data_science

#deep_learning

Spark in me - annual retrospective 2018

Spark in me - annual retrospective 2018 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), December 19, 2018

DS/ML digest 32

Highlights:

- A way to replace softmax in NMT;

- Large visual reasoning dataset;

- PyText;

spark-in.me/post/2018_ds_ml_digest_32

#digest

#deep_learning

#data_science

2018 DS/ML digest 32

2018 DS/ML digest 32 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), December 10, 2018

Simpsons paradox

Nice explanation

towardsdatascience.com/simpsons-paradox-and-interpreting-data-6a0443516765

#data_science

Simpson’s Paradox and Interpreting Data

The challenge of finding the right view through data


snakers4 (Alexander), December 09, 2018

DS/ML digest 31

Highlights of the week:

- PyTorch 1.0 released;

- Drawing with GANs;

- BERT explained;

spark-in.me/post/2018_ds_ml_digest_31

#digest

#deep_learning

#data_science

2018 DS/ML digest 31

2018 DS/ML digest 31 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), December 02, 2018

A cheeky ML/DS themed sticker pack for our channel

Thanks to @birdborn for his art.

You are welcome to use it:

t.me/addstickers/ML_spark_in_me_by_BB

If you would like to contribute / create your own stickers - please ask around in our channel chat.

#data_science

snakers4 (Alexander), November 28, 2018

DS/ML digest 30

spark-in.me/post/2018_ds_ml_digest_30

#digest

#deep_learning

#data_science

2018 DS/ML digest 30

2018 DS/ML digest 30 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 23, 2018

Jupyter extensions

Looks like they are near end of their support.

Alas.

On a fresh build you will need this

conda install notebook=5.6

To use them.

Will need to invest some time into making Jupyter Lab actually usable.

#data_science

snakers4 (Alexander), November 22, 2018

Our victory in CFT-2018 competition

TLDR

- Multi-task learning + seq2seq models rule;

- The domain seems to be easy, but it is not;

- You can also build a pipeline based on manual features, but it will not be task agnostic;

- Loss weighting is crucial for such tasks;

- Transformer trains 10x longer;

spark-in.me/post/cft-spelling-2018

#nlp

#deep_learning

#data_science

Winning a CFT 2018 spelling correction competition

Building a task-agnostic seq2seq pipeline on a challenging domain Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 15, 2018

DS/ML digest 29

spark-in.me/post/2018_ds_ml_digest_29

#digest

#deep_learning

#data_science

2018 DS/ML digest 29

2018 DS/ML digest 29 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 10, 2018

Towards Data Science

Our article was accepted to their publication:

- towardsdatascience.com/building-client-routing-semantic-search-in-the-wild-14db04687c7e

Also when you have published once there, then you can just publish your work on TDS on recurrent basis =)

I doubt that this will be properly distributed to all 130k of their subs, but nevertheless this is a milestone.

#data_science

Building client routing / semantic search in the wild

A comparison of novel NLP techniques within an applied business setting


snakers4 (Alexander), November 06, 2018

DS/ML digest 28

Google open sources pre-trained BERT ... with 102 languages ...

spark-in.me/post/2018_ds_ml_digest_28

#digest

#deep_learning

#data_science

2018 DS/ML digest 28

2018 DS/ML digest 28 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), November 03, 2018

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

A brief executive summary about what we achieved at Profi.ru.

If you have similar experience or have anything similar to share - please do not hesitate to contact me.

Also we are planning to extend this article into a small series, if it gains momentum. So please like / share the article if you like it.

spark-in.me/post/profi-ru-semantic-search-project

#nlp

#data_science

#deep_learning

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru

Building client routing / semantic search and clustering arbitrary external corpuses at Profi.ru Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 23, 2018

DS/ML digest 27

NLP in the focus again!

spark-in.me/post/2018_ds_ml_digest_27

Also your humble servant learned how to do proper NMT =)

#digest

#deep_learning

#data_science

2018 DS/ML digest 27

2018 DS/ML digest 27 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 22, 2018

Amazing articles about image hashing

Also a python library

- Library github.com/JohannesBuchner/imagehash

- Articles:

fullstackml.com/wavelet-image-hash-in-python-3504fdd282b5http://www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

www.hackerfactor.com/blog/index.php?/archives/529-Kind-of-Like-That.html

#data_science

#computer_vision

JohannesBuchner/imagehash

A Python Perceptual Image Hashing Module. Contribute to JohannesBuchner/imagehash development by creating an account on GitHub.


Text iterators in PyTorch

Looks like PyTorch has some handy data-processing / loading tools for text models - torchtext.readthedocs.io.

It is explained here - bastings.github.io/annotated_encoder_decoder/ - how to use them with pack_padded_sequence and pad_packed_sequence to boost PyTorch NLP models substantially.

#nlp

#deep_learning

snakers4 (Alexander), October 15, 2018

An Open source alternative to Mendeley

Looks like that Zotero is also cross-platform, and open-source

Also you can import the whole Mendeley library with 1 button push:

www.zotero.org/support/kb/mendeley_import

#data_science

kb:mendeley import [Zotero Documentation]

Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.


www.youtube.com/watch?v=KJAnSyB6mME

PyTorch developer conference part 1
Sessions on Applied Research in Industry & Developer Education. Talks from @karpathy (@Tesla), @ctnzr (@nvidia), @ftzo (Pyro/@UberEng), @MarkNeumannnn (@alle...

snakers4 (Alexander), October 15, 2018

DS/ML digest 26

More interesting NLP papers / material ...

spark-in.me/post/2018_ds_ml_digest_26

#digest

#deep_learning

#data_science

2018 DS/ML digest 26

2018 DS/ML digest 26 Статьи автора - http://spark-in.me/author/snakers41 Блог - http://spark-in.me


snakers4 (Alexander), October 10, 2018

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science

snakers4 (Alexander), October 08, 2018

Going from millions of points of data to billions on a single machine

In my experience pandas works fine with tables up to 50-100m rows.

Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.

But sometimes it is just good to know that such things exist:

- vaex.io/ for large data-frames + some nice visualizations;

- Datashader.org for large visualizations;

- Also you can use Dask for these purposes I guess jakevdp.github.io/blog/2015/08/14/out-of-core-dataframes-in-python/;

#data_science

Python3 nvidia driver bindings in glances

They used to have only python2 ones.

If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.

So convenient.

#linux

older first