Cross Validated Asked by Stephen Clark on November 12, 2021
I have a patchy time series of means, medians and sample sizes (n) that occasionally appears to be "contaminated" (either by bad or atypical data). For example:
median<-c(605 598 600 752 610 662 600 730 715 650 750 275 675 650 1267 800 675 650)
mean<- c(682 632 800 886 709 789 730 775 840 810 850 538 685 783 1296 959 2315 707)
n<- c(159 107 112 39 82 45 122 79 96 73 198 189 79 38 225 174 115 108)
Here the penultimate observation has a mean 3.5 times that of the median and is probably "contaminated". Is there a way to use information on just the mean, median and sample size (or neighbouring observations?) to identify a sample that is atypical? On a normality assumption, mean/median = 1.0
? I do not have access to the original sample observations. Also, here I have data for every quarter, but for other locations there are quarters for which I have no data.
I found this reference, but it is a bit dated and only covers cases with low N. Dixon, W. J. "Processing data for outliers." Biometrics 9, no. 1 (1953): 74-89. https://www.jstor.org/stable/pdf/3001634.pdf
I would say that you may need a measure of spread, to make a test you would need the standard deviation, or the interquartile range or something like that. In theory, we have that $|E[X]-Median(X)|le sigma$ (see wikipedia) and in fact this bound is tight you can find distributions for which this is the case so if you have no information on the spread of your numbers, having a 3 times increase between the mean and the median can happen even for non-contaminated data. For example, take a log-normal distribution of parameter $mu,sigma$, the median is $exp(mu)$ and the mean is $exp(mu+sigma^2/2)$ so depending on $sigma$ the difference between the mean and the median can be huge and a log-normal distribution is not contaminated by outliers.
Answered by TMat on November 12, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP