TransWikia.com

Sampling methods for Text datasets (NLP)

Data Science Asked by Aaditya ura on October 21, 2020

I am working on two text datasets, one is having 68k text samples and other is having 100k text samples. I have encoded the text datasets into bert embedding.

Text sample > 'I am working on NLP' ==> bert encoding ==> [0.98, 0.11, 0.12....nth]
               # raw text 68k                              # bert encoding [68000, 1024]

I want to try different custom NLP models on these embeddings, but dataset large to test the model’s performance quickly.

To check different models quickly, the best way is to take a small subset of dataset from the entire population and feed it to different algorithms. At last, choose the top algorithms to fit the entire dataset.

I am planning to sample at least 10k samples subset from 68k dataset and 10k subset from 100k dataset. I could select randomly 10k from 68k but that method is not the best way to sample.

Any advice on how to sample embeddings(text) from 68k samples while maintaining the probability distribution of the original population and how many samples would be enough for one sample subset?

Thank you!

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP