If you are mining for a large web-corpus
... for any language other than English. And you do not want to scrape anything, buy proxies or learn this a bit shady "stack".
In case of Russian the above posted link to araneum has already processed data - which may be useless for certain domains.
What to do?
In case of Russian you can write here
The author will share her 90+GB RAW corpus with you
In case of any other language there is a second way
- Go to common crawl website;
- Download the index (200 GB);
- Choose domains in your country / language (now they also have language detection);
- Download only plain-text files you need;
Links to start with
An open-source corpus for machine learning.