January 25, 12:31

Downsides of using Common Crawl

Took a look at the Common Crawl data I myself pre-processed last year and could not find abstracts - only sentences.

Took a look at these - archives - data.statmt.org/ngrams/deduped/ - also only sentences, though they seem to be in logical order sometimes.

You can use any form of CC - but only to learn word representations. Not sentences.