How to compare between two datasets of lexical density?

Data Science Asked by Samir Ahmane on February 11, 2021

I have two dataset from two different texts representing lexical density as a proportion based on a corpus. Both datasets are represented in the images below. Now, let’s suppose I want to know which text has more uncommon vocabulary. How should I proceed? What statistics should I use? Should it be a t-students test or Wilcoxon signed-rank test? I’m lost on this one, and I don’t wanna apply inference blindly. I am using the python library wordfreq to get word frequencies data.

data cleaning dataset hypothesis testing nlp statistics

Add your own answers!

Ask a Question

Get help from others!

Recent Answers

Joshua Engel on Why fry rice before boiling?
Jon Church on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?
Peter Machado on Why fry rice before boiling?
haakon.io on Why fry rice before boiling?