A tail bound for an unknown distribution via sampling

Question

I know that many results exist for making an argument about the tail of a distribution, i.e., for a random variable $X$, one can find a bound $epsilon$ such that $Pr[X geq a]<epsilon$. Some examples are Markov's inequality or Chernoff bound. In order to use such tail bounds, one needs to know the distribution of $X$, or some property of the distribution such as the mean or variance.
On the other hand, if the distribution of $X$ is unknown but one can draw samples from that distribution, it is possible to estimate properties such as the mean or the variance of $X$ by sampling and there are results that guarantee how good our estimation is based on the number of samples used.
Now I have the following question. I do not have any knowledge about the distribution of $X$, but I can query as many samples as I want from its distribution. My goal is to find a bound for the lower tail of $X$. In particular, for a given $a$, I am looking for a guarantee in the following form: $Pr[X<a]<epsilon$ (alternatively I can say I am looking for an $a$ given an $epsilon$)
A naive sampling approach would be to build a histogram of the values of the samples, and compute the empirical probability $Pr[X<a]$. However, I am wondering if there are better ways to find such a bound by sampling. Ideally the number of samples should appear in the probability bound. Does it make sense to first estimate the mean and then use something like Markov's inequality or Chernoff bound? How does the uncertainty about the mean affect the final tail bound? Do you have any suggestion about a sampling algorithm which results in a fairly tight bound that depends on the number of samples?

annie_lee · Answer

There is a generalization of Chernoff Bound due to Hoeffding called Hoeffding's inequality.
Let $mathcal{X} = (x_1, . . ., x_N )$ be a finite population of $N$ points and $X_1, . . ., X_n$ be a random sample drawn without replacement from
$mathcal{X}$ . Let
$$ a = min_{1leq i leq N} x_i, b=max_{1leq i leq N} x_i$$
Then for all $epsilon > 0$
$$mathbb{P}bigg(frac{1}{n}sum_{i=1}^n X_i -mu geq epsilon bigg)leq expbigg(- frac{2nepsilon^2}{(b-a)^2}bigg)$$
Where $mu = sum_{i=1}^N x_i$ is the mean of $mathcal{X}$.
There is a reference, named
Concentration inequality using sample.

A tail bound for an unknown distribution via sampling

One Answer

Add your own answers!

Ask a Question