Linguistics Asked by Tt22 on December 6, 2021
Any help on finding the biggest freely available English corpus that can be used on research?
So far I have found OANC with 15 M words.
CommonCrawl
crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2011. It completes crawls generally every month.
It's a few billion pages (petabytes of data).
You can find versions of it that are already cleaned, de-duplicated and split by language.
Answered by Adam Bittlingmayer on December 6, 2021
What about 1 Billion Word Language Model Benchmark? It is freely available for download.
Also you might find this Reddit thread useful for other corpus' links.
Answered by hafiz031 on December 6, 2021
The COBUILD corpus (18M tokens) is available through WebCelex, if the arcane user interface isn't a deal-breaker. It's valuable more for its extensive manual annotations than its size, with quite a lot of morphological and phonological information available.
(It's smaller than most of the others listed here, but seems worth mentioning, since it's larger than the OANC mentioned in the question and is well-annotated.)
Answered by Draconis on December 6, 2021
I found the Exquisite Corpus and it's freely avalaible. A detail of the sources can be seen here. I really don't know the exact size, but it's on the billions' scale.
Answered by ofou on December 6, 2021
Sketch Engine, a corpus manager and text analysis software, provides a few corpora with open access for research on https://app.sketchengine.eu/#open The largest English corpus (freely available) is ACL Anthology Reference Corpus with 62 million words.
On the other hand, you can try 30-day free trial of Sketch Engine and search one of the biggest English corpus which currently exists with over 35 billion words, see more at https://www.sketchengine.eu/timestamped-english-corpus/
Answered by Rodrigo on December 6, 2021
Westbury labs provides a ~1 billion word Wikipedia dump of all articles with greater than 1000 words from 2010: http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html
BYU has a larger dump (1.9 bn) from 2014, but it's not available for download.
Answered by Jeremy Salwen on December 6, 2021
Can't beat the Global Web-Based English Corpus proposed by robert---but here are is another big one:
A Wikipedia dump is also huge ...
Answered by jk - Reinstate Monica on December 6, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP