Computer science corpus for training a language model

Question

I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it.

Is there anything out-of-the box that I could use? 
*I tried to look for the sciBERT corpus, can not find how to access it.

Thanks!

gust · Answer

Depends on the domain and language, but I'll share an adaptive example.

The wikipedia corpus's English version contains more than 1.9 billion words from 4.4 million articles.

You can create create virtual corpora from the full corpus to contain only topics of interest, such as biology, investments, Buddhism, psychology, cars, basketball, etc.

Computer science corpus for training a language model

One Answer

Add your own answers!

Ask a Question