October 10, 2018

Another set of links for common crawl for NLP

Looks like we were not the first, ofc.

Below are some projects dedicated to NLP corpus retrieval on scale:

- Java + license detection + boilerplate removal: zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/

- Prepared deduplicated CC text archives data.statmt.org/ngrams/deduped/

- Google group

groups.google.com/forum/#!topic/common-crawl/6F-yXsC35xM

Wow!

#nlp

Downloading 200GB files in literally hours

(1) Order 500 Mbit/s Internet connection from your ISP

(2) Use aria2 - aria2.github.io/ with -x

(3) Profit

#data_science