Cross Validated Asked on January 14, 2021
I want to sample from the geometric distribution for a very small success rate. The success rate is so small that I represent it by its log. I want the result to also be represented by its log. Is there a numerically stable way to do this?
Eg.
log(success rate) = -2000
log(sample) ~ 2000
Let $X$ be the number of independent trials of a Bernoulli$(p)$ variable needed to observe the first success, so that $X$ can have the values $1,2,3,ldots.$ For any such value its cumulative probability function is given by
$$F_p(k) = Pr(Xle k) = 1 - (1-p)^k.$$
This is readily inverted for drawing values of $X.$ That is, given a uniform variable $U$ in $[0,1],$ the value
$$x_p(U) = lceil log_{1-p} U rceil = lceil frac{log(U)}{ log(1-p)}rceiltag{*}$$
has the same distribution as $X.$ (The brackets $lceil rceil$ denote the ceiling, or next greatest integer.)
When $p$ is tiny, $X$ is likely going to be a huge value. This is intuitively clear: when you have little chance of success, you will likely see a very long string of failures before you do eventually succeed. One rigorous demonstration goes like this: pick a huge number $Mgg 0$ that is still very tiny compared to $1/p,$ so that $pM$ is small. The chance that $X$ exceeds $M$ is
$$Pr(X gt M) = 1 - F_p(M) = left(1-p)^{1/p}right)^{pM} approx e^{-pM} approx 1 -pM approx 1.$$
In fact, when computing in double-precision floating point arithmetic, these approximations are equalities when $pM le 2^{-53}approx 10^{-16}.$ In other words, on a log (base 10) scale, you just won't see any values of $log(X)$ that are more than $16$ less than $log(1/p) = -log(p).$
When any number is huge, there's no discernible difference between it and its ceiling (the next highest integer). This permits us to drop the ceiling operation from $(*)$ and take the logarithms of both sides to obtain
$$log(x_p(U)) approx log(-log(U)) - log(-log(1-p)) approx log(-log(U)) + log(p).$$
That's the solution. Just generate uniform variables $U$ in the interval $(0,1)$ and apply this formula. (The error in the last approximation is proportional to $-p/2,$ which is truly tiny.)
This, by the way, is a reverse Gumbel distribution.
As an example, let $p = 10^{-2000}.$ Here is a histogram of a million realizations of $X$ using this solution.
Notice how in this simulation of size $10^6$ there were no values of $log X$ more than $6$ less than $-log(p)=2000.$ This is as we would expect.
The R
code to run this large simulation took well under one second, attesting to its efficiency:
x <- log(-log(runif(1e6))) / log(10) + 2000
The three constants are 2000
, the negative log of $p;$ 10
, the base of that logarithm; and 1e6
, the number of values to generate.
Answered by whuber on January 14, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP