Data Science Asked on June 18, 2021
I am looking for a domain specific computer science corpus of at least 20M words (preferable >50M words), for the purpose of training a language model in it.
Is there anything out-of-the box that I could use?
*I tried to look for the sciBERT corpus, can not find how to access it.
Thanks!
Depends on the domain and language, but I'll share an adaptive example.
The wikipedia corpus's English version contains more than 1.9 billion words from 4.4 million articles.
You can create create virtual corpora from the full corpus to contain only topics of interest, such as biology, investments, Buddhism, psychology, cars, basketball, etc.
Answered by gust on June 18, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP