September 29, 2018

If you are mining for a large web-corpus

... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".

In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.

What to do?

(1)

In case of Russian you can write here

tatianashavrina.github.io/taiga_site/

The author will share her 90+GB RAW corpus with you

(2)

In case of any other language there is a second way

- Go to common crawl website;

- Download the index (200 GB);

- Choose domains in your country / language (now they also have language detection);

- Download only plain-text files you need;

Links to start with

- commoncrawl.org/connect/blog/

- commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

- www.slideshare.net/RobertMeusel/mining-a-large-web-corpus

#nlp

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

An open-source corpus for machine learning.