Sample log geometric distribution from log probability

Question

I want to sample from the geometric distribution for a very small success rate. The success rate is so small that I represent it by its log. I want the result to also be represented by its log. Is there a numerically stable way to do this?
Eg.
log(success rate) = -2000
log(sample) ~ 2000

whuber · Answer

Let $X$ be the number of independent trials of a Bernoulli$(p)$ variable needed to observe the first success, so that $X$ can have the values $1,2,3,ldots.$  For any such value its cumulative probability function is given by
$$F_p(k) = Pr(Xle k) = 1 - (1-p)^k.$$
This is readily inverted for drawing values of $X.$  That is, given a uniform variable $U$ in $[0,1],$ the value
$$x_p(U) = lceil log_{1-p} U rceil = lceil frac{log(U)}{ log(1-p)}rceiltag{*}$$
has the same distribution as $X.$  (The brackets $lceil rceil$ denote the ceiling, or next greatest integer.)
When $p$ is tiny, $X$ is likely going to be a huge value.  This is intuitively clear: when you have little chance of success, you will likely see a very long string of failures before you do eventually succeed.  One rigorous demonstration goes like this: pick a huge number $Mgg 0$ that is still very tiny compared to $1/p,$ so that $pM$ is small.  The chance that $X$ exceeds $M$ is
$$Pr(X gt M) = 1 - F_p(M) = left(1-p)^{1/p}right)^{pM} approx e^{-pM} approx 1 -pM approx 1.$$
In fact, when computing in double-precision floating point arithmetic, these approximations are equalities when $pM le 2^{-53}approx 10^{-16}.$  In other words, on a log (base 10) scale, you just won't see any values of $log(X)$ that are more than $16$ less than $log(1/p) = -log(p).$
When any number is huge, there's no discernible difference between it and its ceiling (the next highest integer).  This permits us to drop the ceiling operation from $(*)$ and take the logarithms of both sides to obtain

$$log(x_p(U)) approx log(-log(U)) - log(-log(1-p)) approx log(-log(U)) + log(p).$$

That's the solution.  Just generate uniform variables $U$ in the interval $(0,1)$ and apply this formula.  (The error in the last approximation is proportional to $-p/2,$ which is truly tiny.)
This, by the way, is a reverse Gumbel distribution.
As an example, let $p = 10^{-2000}.$  Here is a histogram of a million realizations of $X$ using this solution.

Notice how in this simulation of size $10^6$ there were no values of $log X$ more than $6$ less than $-log(p)=2000.$  This is as we would expect.

The R code to run this large simulation took well under one second, attesting to its efficiency:
x <- log(-log(runif(1e6))) / log(10) + 2000

The three constants are 2000, the negative log of $p;$ 10, the base of that logarithm; and 1e6, the number of values to generate.

Sample log geometric distribution from log probability

One Answer

Add your own answers!

Ask a Question