Intuition Behind binomial (logistic) GLM

Question

This is a question regarding using logistic regression, and relating it to gaussian distribution or a binomial distribution.
model <- glm(target ~ x1, data=data, type='response', family='binomial')
model <- glm(target ~ x1, data=data, type='response')  #defaults to gaussian

My understanding of binomial is that it is
theta=chance of success
z=trails ending in success
k=trials ending in failure
(theta^z)*(1-theta)^k

And something Gaussian is
theta = standard deviation
x = success
u = mean
Y = [ 1/σ * sqrt(2π) ] * e -(x - μ)2/2σ2

So I understand how to do GLM with R, I kind of understand what binomial and gaussian means, but I have no understanding of how you relate binomial or gaussian to logistic regression, and how binomial and gaussian are different in this context.
Question 1- Can someone explain the intuition behind how "family='binomial'" is used when building a model with GLM?
Question 2- Given that the shapes of a binomial distribution and a Gaussian distribution look very much the same (they both peak in the middle and gradually go down towards the ends), how does choosing either binomial or Gaussian lead to different models built from GLM?

bdeonovic · Answer

Lets say your response variable is $Y$. In regression we want to model our response variable as a linear combination of our predictor variables ($X$) e.g. $Y=beta_0 + beta_1X + epsilon$ or $E[Y]=beta_0 + beta_1X$. But what happens when our response variable is only in $[0,1]$ (i.e. it is a probability, proportion or  strictly only 0 or 1). Notice that $beta_0 + beta_1X$  may take any value on the real line! It could be 0 or 1 or 100 or even negative! If our response variable is strictly in $[0,1]$ it makes no sense to try to use a model that can take values outside of that range.

Therefore, when we want to model a probability or a proportion, we instead model a function of $Y$. For example $g(E[Y])=beta_0 + beta_1X$. This function is called the link function.

$g(E[Y]) = E[Y]$  Identity link. Used in Linear regression

$g(P(Y=1)) = log{dfrac{P(Y=1)}{1-P(Y=1)}}$ Logit link, Used in logistic regression. Notice here we are modeling the probability $Y=1$ which is also the expected value. Then we can solve for what we want: $P(Y=1) = dfrac{e^{beta_0+beta_1X}}{1+e^{beta_0+beta_1X}}$

$g(E[Y]) = log{E[Y]}$ log link, used in poisson regression

Question 1: In the GLM R function, the family parameter allows you to specify the link function.

Question 2: First of all, it is not true that binomial always looks like normal. If $Xsim Binomial(n, p=0.1)$ it is a skewed distribution, which does not look like a bell shaped distribution.

Aghila · Answer

You use logistic regression when your response variable is binary (0/1) or a proportion (10/30) so you can't relate it to a gaussian distribution which is continuos and has no boundaries. That's why you specify "family="binomial" to perform logistic regression in R and family="gaussian" to perform linear regression.

Answered by Aghila on December 15, 2021

Intuition Behind binomial (logistic) GLM

2 Answers

Add your own answers!

Ask a Question