TransWikia.com

Do we need hypothesis testing when we have all the population?

Cross Validated Asked by Siddhi Kiran Bajracharya on November 2, 2021

From what I understand hypothesis testing is done to identify if a finding in the sample population is statistically significant. But if I have a census data, do we really need hypothesis testings?

I was thinking may be I should perform multiple random sampling from the census data and see if there is any random behavior.

7 Answers

Let me add something at the good answers above. Some of them address mainly the problem of reliability of the condition “have all the population”, as the accepted one, and related practical points. I propose a more theoretical perspective, related to the Sergio’s answer but not equal.

If you say you “have all the population”, I focus on the case where the population is finite. I also consider the case of infinite data in the following. Another aspect seems me relevant also. The data are about one variable only (case 1) or several variables are collected (case 2):

  1. If the data is about one variable, you can perfectly compute all the moments and all indicators you want. Moreover you know/see, by plotting, the exact distribution. Note that, if the variable is continuous, finite data hardly fits perfectly any parametric distribution. Ideally, if the data is infinite, all incorrect distributions are definitely rejectable by some test and only the correct one is not rejected (the test can remain useful only because it possible to lose something by plotting). In this case, parameters also an be perfectly computed. Hypothesis testing about reliability of some statistical quantity (its proper meaning) becomes senseless.

  2. If several variables are collected, the above considerations above hold, but another must be added. In a purely descriptive situation, like case 1, it is relevant to note that multivariate concepts like correlations and any other dependencies metrics become perfectly known.

    However I don’t love description in the multivariate case because in my experience any multivariate measure, above all the regression, leads to think about some kind of effect that has more to do with causation and/or prediction than description (see: Regression: Causation vs Prediction vs Description). If you want to use the data to answer causal questions, the fact that you know the entire population (exact joint distribution) does not warrant anything. Causal effects that you can try to measure with your data by regression or other metrics, can be completely wrong. The standard deviation of these effects is $0$, but a bias can remain.

    If your goal is prediction, the question gets a bit more complicated. If the population is finite, nothing remains to predict. If the data is infinite, you cannot have all of it. In the purely theoretical point of view, let me remain in regression case, you can have an infinite amount of data that permit you compute (more than estimate) the parameters. So you can predict some new data. However, what data you have matters yet. It is possible to show that, if we have an infinite amount of data, the best prediction model coincides with the true model (data-generating process) like in the causal question (see the reference in the previous link). Then your prediction model can be far from the best one. Like before, the standard deviation is $0$, but a bias can remain.

Answered by markowitz on November 2, 2021

I would be very wary about anyone claiming to have knowledge about the complete population. There is a lot of confusion about what this term means in a statistical context, leading to people claiming they have the complete population, when they actually don't. And where the complete population is known, the scientific value is not clear.

Assume you want to figure out if higher education leads to higher income in the US. So you get the level of education and the annual income of every person in the US in 2015. That's your demographic population.

But it isn't. The data is from 2015 but the question was about the relation in general. The actual population would be the data from every person in the US in every year in the past and yet to come. There is no way to ever get data for this statistical population.

Also, if you look at the definition of a theory given e.g. by Popper, then a theory is about predicting something unknown. That is, you need to generalize. If you have a complete population, you are merely describing that population. That may be relevant in some fields but in theory driven fields, it doesn't have much value.

In psychology there have been some researchers that abused this misunderstanding between population and sample. There have been cases where researchers claimed that their sample is the actual population, i.e. the results only apply to those people that have been sampled, and therefore a failure to replicate the results is just due to the use of a different population. Nice way out, but I really don't know why I should read a paper that only makes a theory about a small number of annonymous people that I will probably never ever encounter and that may not be applicable to anyone else.

Answered by LiKao on November 2, 2021

Let's say you are measuring height in the current world population and you want to caompare male and female height.

To check the hypothesis "average male height for men alive today is higher than for women alive today", you can just measure every man and woman on the planet and compare the results. If male height is on average 0.0000000000000001cm bigger even with a standard deviation trillions of times bigger, your hypothesis is proven correct.

However, such a conclusion is probably not useful in practice. Since people are constantly being born and dying, you probably don't care about the current population, but about a more abstract population of "potentially existing humans" or "all humans in history" of which you take people alive today as a sample. Here you need hypothesis testing.

Answered by David on November 2, 2021

To illustrate my points, I will assume that everybody has been asked whether they prefer Star Trek or Doctor Who and has to choose one of them (there is no neutral option). To keep things simple, let’s also assume that your census data is actually complete and accurate (which it rarely ever is).

There are some important caveats about your situation:

  1. Your demographic population hardly ever is your statistical population. In fact, I cannot think of a single example where it is reasonable to ask the kind of questions answered by statistical tests about a statistical population that is a demographic population.

    For example, suppose you want to settle once and for all the question whether Star Trek or Doctor Who is better, and you define better via the preference of everybody alive at the time of the census. You find that 1234567 people prefer Star Trek and 1234569 people prefer Doctor Who. If you want to accept this verdict as it is, no statistical test is needed.

    However, if you want to find out whether this difference reflects actual preference or can be explained by forcing undecided people to make a random choice. For example, you can now investigate the null model that people choose between the two randomly and see how extreme a difference of 2 is for your demographic population size. In that case, your statistical population is not your demographic population, but the aggregated outcome of an infinite amount of censuses performed on your current demographic population.

  2. If you have data the size of the population of a reasonably sized administrative region and for the questions usually answered by it, you should focus on effect size, not on significance.

    For example, there are no practical implications whether Star Trek is better than Doctor Who by a small margin, but you want to decide practical stuff like how much time to allot to the shows on national television. If 1234567 people prefer Star Trek and 1234569 people prefer Doctor Who, you would decide to allot both an equal amount of screen time, whether that tiny difference is statistically significant or not.

    On a side note, once you care about effect size, you may want to know the margin of error of this, and this can be indeed be determined by some random sampling as you are alluding to in your question, namely bootstrapping.

  3. Using demographic populations tends to lead to pseudoreplication. Your typical statistical test assumes uncorrelated samples. In some cases you can avoid this requirement if you have good information on the correlation structure and build a null model based on this, but that’s rather the exception. Instead, for smaller samples, you avoid correlated samples by explicitly avoiding to sample two people from the same household or similar. When your sample is the entire demographic population, you cannot do this and thus you inevitably have correlations. If you treat them as independent samples nonetheless, you commit pseudoreplication.

    In our example, people do not arrive at a preference of Star Trek or Doctor Who independently, but instead are influenced by their parents, friends, partners, etc. and their fates align. If the matriarch of some popular clan prefers Doctor Who, this is going to influence many other people thus leading to pseudoreplication. Or, if four fans are killed in a car crash on their way to a Star Trek convention, boom, pseudoreplication.

To give another perspective at this, let’s consider another example that avoids the second and third problem as much as possible and is somewhat more practical: Suppose you are in charge of a wildlife reserve featuring the only remaining pink elephants in the world. As pink elephants stand out (guess why they are endangered), you can easily perform a census on them. You notice that you have 50 female and 42 male elephants and wonder if this indicates a true imbalance or can be explained by random fluctuations. You can perform a statistical test with the null hypothesis that the sex of pink elephants is random (with equal probability) and uncorrelated (e.g., no monozygotic twins). But here again, your statistical population is not your ecological population, but all pink elephants ever in the multiverse, i.e., it includes infinite hypothetical replications of the experiment of running your wildlife reserve for a century (details depend on the scope of your scientific question).

Answered by Wrzlprmft on November 2, 2021

Funny. I spent years explaining to clients that in cases with true census information there was no variance and therefore statistical significance was meaningless.

Example: If I have data from 150 stores in a supermarket chain that says 15000 cases of Coke and 16000 cases of Pepsi were sold in a week, we can definitely say that more cases of Pepsi were sold. [There might be measurement error, but not sampling error.]

But, as @Sergio notes in his answer, you might want an inference. A simple example might be: is this difference between Pepsi and Coke larger than it typically is? For that, you'd look at the variation in the sales difference versus the sales difference in previous weeks, and you'd draw a confidence interval or do a statistical test to see if this difference was unusual.

Answered by zbicyclist on November 2, 2021

In typical applications of hypothesis testing, you do not have access to the whole population of interest, but you want to make statements about the parameters that govern the distribution of the data in the population (mean, variance, correlation,...). Then, you take a sample from the population, and assess if the sample is compatible with the hypothesis that the population parameter is some pre-specified value (hypothesis testing), or you estimate the parameter from you sample (parameter estimation).

However, when you really have the whole population, you are in the rare position that you have direct access to the true population parameters - for example, the population mean is just the mean of all the values of the population. Then you don't need to perform any further hypothesis testing or inference - the parameter is exactly what you have.

Of course, the situations where you really have data from the whole population of interest are exceptionally rare, and mostly constrained to textbook examples.

Answered by Lukas McLengersdorff on November 2, 2021

It all depends on your goal.

If you want to know how many people smoke and how many people die of lung cancer you can just count them, but if you want to know whether smoking increases the risk for lung cancer then you need statistical inference.

If you want to know high school students' educational attainments, you can just look at complete data, but if you want to know the effects of high school students' family backgrounds and mental abilities on their eventual educational attainments you need statistical inference.

If you want to know workers' earnings, you can just look at census data, but if you want to study the effects of educational attainment on earnings, you need statistical inference (you can find more examples in Morgan & Winship, Counterfactuals and Causal Inference: Methods and Principles for Social Research.)

Generally speaking, if you are only looking for summary statistics in order to communicate the largest amount of information as simply as possible, you can just count, sum, divide, plot etc.

But if you wish to predict what will happen, or to understand what causes what, then you need statistical inference: assumptions, paradigms, estimation, hypothesis testing, model validation, etc.

Answered by Sergio on November 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP