I guess PyTorch is in the bottom left corner, but realistically the author of this snippet did a lot of import A as B
Google's super resolution zoom
Finally Google made something interesting
Mixed precision distributed training ImageNet example in PyTorch
An Open source alternative to Mendeley
Looks like that Zotero is also cross-platform, and open-source
Also you can import the whole Mendeley library with 1 button push:
Zotero is a free, easy-to-use tool to help you collect, organize, cite, and share research.
Another set of links for common crawl for NLP
Looks like we were not the first, ofc.
Below are some projects dedicated to NLP corpus retrieval on scale:
- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darms
- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/
- Google group
Downloading 200GB files in literally hours
(1) Order 500 Mbit/s Internet connection from your ISP
(2) Use aria2 - aria2.github.io/ with -x
A small continuation of the crawling saga
2 takes on the Common Crawl
It turned out to be a bit tougher than expected
Going from millions of points of data to billions on a single machine
In my experience pandas works fine with tables up to 50-100m rows.
Ofc plain indexing/caching (i.e. pre-process all of your data in chunks and index it somehow) and / or clever map/reduce like style optimizations work.
But sometimes it is just good to know that such things exist:
- vaex.io/ for large data-frames + some nice visualizations;
- Datashader.org for large visualizations;
- Also you can use Dask for these purposes I guess jakevdp.github.io/blog/2015/08/1
Python3 nvidia driver bindings in glances
They used to have only python2 ones.
If you update your drivers and glances, you will get a nice GPU memory / load indicator within glances.
Wiki graph database
Just found out that Wikipedia also provides this
May be useful for research in future.
Seems very theoretic and probably works only for English, but it is best to keep such things on the radar.
People who were born in Berlin before 1900
German musicians with German and English descriptions
Musicians who were born in Berlin
PCIE risers that REALLY WORK for DL
Thermaltake TT Premium PCIE 3.0 extender.
All the others I tried were crap.
Monkey patching a PyTorch model
Well, ideally you should not do this.
But sometimes you just need to quickly test something and amend your model on the fly.
def rsetattr(obj, attr, val):
pre, _, post = attr.rpartition('.')
return setattr(rgetattr(obj, pre) if pre else obj, post, val)
def rgetattr(obj, attr, *args):
def _getattr(obj, attr):
return getattr(obj, attr, *args)
return functools.reduce(_getattr, [obj] + attr.split('.'))
for module in model.named_modules():
old_module_path = module
old_module_object = module
# replace an old object with the new one
# copy some settings and its state
new_module = SomeOtherClass(old_module_object.some_settings,
The above code essentially does the same as:
.path.to.some.block = some_other_block
Amazingly simple code to mimic fast-texts n-gram subword routine
Nuff said. Try it for yourself.
from fastText import load_model
model = load_model('official_fasttext_wiki_200_model')
def find_ngrams(string, n):
ngrams = zip(*[string[i:] for i in range(n)])
ngrams = [''.join(_) for _ in ngrams]
string = 'грёзоблаженствующий'
ngrams = 
for i in range(3,7):
ft_ngrams, ft_indexes = model.get_subwords(string)
ngrams = set(ngrams)
ft_ngrams = set(ft_ngrams)
Parsing Wikipedia in 4 plain commands in Python
Wrote a small take on using Wikipedia as corpus for NLP.
Please like / share / repost the article =)
Andrew Ng book
Looks like its draft is finished.
It describes in plain terms how to build ML pipelines:
If you are mining for a large web-corpus
... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".
In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.
What to do?
In case of Russian you can write here
The author will share her 90+GB RAW corpus with you
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;
Links to start with
An open-source corpus for machine learning.
New fast.ai course
Mainly decision tree practice.
A lot about decision tree visualization
I personally would check out the visualization bits.
At least it looks like they are not pushing their crappy library =)
The problem with any such visualizations is that they work only for toy datasets.
Drop / shuffle method seems to be more robust.
Araneum russicum maximum
TLDR - largest corpus for Russian Internet. Fast-text embeddings pre-trained on this corpus work best for broad internet related domains.
Pre-processed version can be downloaded from rusvectores.
Afaik, this link is not yet on their website (?)
DS/ML digest 24
Key topics of this one:
- New method to calculate phrase/n-gram/sentence embeddings for rare and OOV words;
- So many releases from Google;
If you like our digests, you can support the channel via:
- Sharing / reposting;
- Giving an article a decent comment / a thumbs-up;
- Buying me a coffee (links on the digest);
Using sklearn pairwise cosine similarity
On 7k * 7k example with 300-dimensional vectors it turned out to be MUCH faster than doing the same:
- In 10 processes;
- Using numba;
The more you know.
If you have used it - please PM me.