TransWikia.com

Computer science corpus for training a language model

Data Science Asked on June 18, 2021

I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it.

Is there anything out-of-the box that I could use?
*I tried to look for the sciBERT corpus, can not find how to access it.

Thanks!

One Answer

Depends on the domain and language, but I'll share an adaptive example.

The wikipedia corpus's English version contains more than 1.9 billion words from 4.4 million articles.

You can create create virtual corpora from the full corpus to contain only topics of interest, such as biology, investments, Buddhism, psychology, cars, basketball, etc.

Answered by gust on June 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP