TransWikia.com

Is it bad to have error bars constructed with standard deviation that spans to the negative scale while the variable itself shouldn't be negative?

Cross Validated Asked on December 13, 2021

I have a question regarding error bars. I understand that error bars (EBs) constructed with 1 standard deviation (SD) present different things about the population than EBs constructed with 95% confidence intervals (CI). Namely, EBs with SD show the spread (or dispersion) of the variable’s actual values, while EBs with CI show the range that the actual mean should most likely fall within.

My data include a variable, the number (count) of times a person visits the doctor per year. The mean visit number is 3 and the SD is 5, while the confidence interval is 2.5 to 3.5. Is it inherently wrong to show the EBs based on SD since it would extend to negative values (i.e., 3-5 = -2)? Does it violate any assumption?

If I draw the bar graph showing mean 3 and EBs based on 1 SD, the EBs will range from 0 to 8, can I still claim that ~68% of values fall within 0 to 8, or because it is right skewed and the supposed lower EBs largely reaches the negative, this no longer holds? If so, how can I interpret the 0 to 8 which truncates the negative?

One Answer

No, in this case, it does not make sense to draw error bars using SDs.

Take a step back. Why do we draw error bars with SDs? As you write, it's to show where "much" of the data lies. This makes sense if your data come from a normal distribution: 68% of your data will lie within $pm 1$ SD from the mean, so showing the mean with an error bar of $pm 1$ SD will give you an interval that contains 68% of your data.

However, the number of visits to a doctor is a count, so it is discrete. And it can't be negative. Thus, it can't be normal. For high counts, you can often treat counts as normal, but not for a mean of 3 and an SD of 5. Using SD-based error bars is the wrong way of answering the original question, i.e., showing where "much" of the data falls.

Better: calculate the top and bottom ends of your interval directly, by calculating (e.g.) the 16% and the 84% quantile of your observations. The range between them will again contain 68% of your data, as in the normal case the interval around the mean $pm 1$ SD.

Alternatively, you can fit a distribution. For instance, a mean of 3 and an SD of 5 are consistent with a negative binomial distribution with a mean of 3 and a size parameter of $frac{3^2}{5^2-3}$ (see R's help page ?qnbinom - there are many different parameterizations of the negbin). For such a distribution, we can again calculate the parametric 16%/84% quantiles, which turns out to give us an interval $[0,6]$:

> qnbinom(pnorm(c(-1,1)),mu=3,size=3^2/(5^2-3))
[1] 0 6

Answered by Stephan Kolassa on December 13, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP