Cross Validated Asked by kaecvtionr on December 11, 2020
Say I have a dataset of 1 million high school students.
I’m trying to determine if test scores can be used to determine a successful college performance.
While the dataset has 1 million students, it has many more rows because I have multiple tests for each student.
What I’m trying to determine is if I should exclude some students because I don’t have a sufficient quantity of tests and therefore including them could throw off my results.
For example, Student A, I have 20 tests to use as datapoints; but for Student B, I only have 2 tests. Should I keep Student B when conducting my analysis or drop Student B?
In other words, it makes me think there should be a way to calculate the required sample size of a subgroup in your larger sample.
The comments mentioned that you should first identify what kind of bias you're concerned about. My thought is that it is possible that correlation of test scores with college performance may vary based on the number of tests taken (there could be some confounding variable that affects both number of tests and college performance).
One thought i had on how you could incorporate all your scores is by using a bayesian model where you marginalize out the unknown test scores.
Answered by MONODA43 on December 11, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP