Data Science Asked on May 30, 2021
I have a dataset with ‘n’ features and corresponding labels(binary in nature). How can I calculate the data distribution and frequency distribution of the same? What is the difference btw the two?
If you don't know the definition, how could you calculate anyway?
I looked at a few reliable definitions,
https://www.spss-tutorials.com/frequency-distribution-what-is-it/ https://www.statisticshowto.datasciencecentral.com/data-distribution/ http://makemeanalyst.com/observational-studies-and-experiments/population-distribution-sample-distribution-and-sampling-distribution/
The differences are subtle, and sometimes depend on who you ask.
What I conclude is that the frequency (or sample) distribution is statistics on an actual sample, counted per bin, maybe percentages added to the statistics.
The (population) data distribution is the distribution that you'd expect from the whole population.
For fair coin tosses, the data distribution would be 50/50, though a sample distribution of 10 could give 6/4.
My advice, either use the textbook definitions, or present the statistics that you see fit for your analysis. Repeat the definition if necessary.
If you have a large enough random sample the frequency distribution becomes an estimate for the data distribution anyway (but sometimes you have to prove this to show your sample is random).
When you have $n$ features, you repeat this for all features. E.g. when you have people's 'gender', 'married', 'smokes', 'employed', features, you have to repeat for all these features.
Answered by Pieter21 on May 30, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP