What is the adjusted R-squared formula in lm in R and how should it be interpreted?

Question

What is the exact formula used in R lm() for the Adjusted R-squared? How can I interpret it?

Adjusted r-squared formulas

There seem to exist several formulas to calculate Adjusted R-squared.

Wherry’s formula: $1-(1-R^2)frac{(n-1)}{(n-v)}$
McNemar’s formula: $1-(1-R^2)frac{(n-1)}{(n-v-1)}$
Lord’s formula: $1-(1-R^2)frac{(n+v-1)}{(n-v-1)}$
Stein's formula: $1-big[frac{(n-1)}{(n-k-1)}frac{(n-2)}{(n-k-2)}frac{(n+1)}{n}big](1-R^2)$

Textbook descriptions

According to Field's textbook, Discovering Statistics Using R (2012, p. 273) R uses Wherry's equation which "tells us how much variance in Y would be accounted for if the model had been derived from the population from which the sample was taken". He does not give the formula for Wherry. He recommends using Stein's formula (by hand) to check how well the model cross-validates.
Kleiber/Zeileis, Applied Econometrics with R (2008, p. 59) claim it's "Theil's adjusted R-squared" and don't say exactly how its interpretation varies from the multiple R-squared.
Dalgaard, Introductory Statistics with R (2008, p. 113) writes that "if you multiply [adjusted R-squared] by 100%, it can be interpreted as '% variance reduction'". He does not say to which formula this corresponds.

I had previously thought, and read widely, that R-squared penalizes for adding additional variables to the model. Now the use of these different formulas seems to call for different interpretations.  I also looked at a related question on Stack Overflow (What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?), and the Wharton school's statistical dictionary at UPenn.

Questions

Which formula is used for adjusted r-square by R lm()?
How can I interpret it?

Jeromy Anglim · Answer

1. What formula does lm in R use for adjusted r-square?

As already mentioned, typing summary.lm will give you the code that R uses to calculate adjusted R square. Extracting the most relevant line you get:

ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)

which corresponds in mathematical notation to:

$$R^2_{adj} = 1 - (1 - R^2) frac{n-1}{n-p-1}$$

assuming that there is an intercept (i.e., df.int=1), $n$ is your sample size, and $p$ is your number of predictors. Thus, your error degrees of freedom (i.e., rdf) equals n-p-1.

The formula corresponds to what Yin and Fan (2001) label Wherry Formula-1 (there is apparently another less common Wherry formula that uses $n-p$ in the denominator instead $n-p-1$). They suggest it's most common names in order of occurrence are "Wherry formula", "Ezekiel formlua", "Wherry/McNemar formula", and "Cohen/Cohen formula".

2. Why are there so many adjusted r-square formulas?

$R^2_{adj}$ aims to estimate $rho^2$, the proportion of variance explained in the population by the population regression equation. While this is clearly related to sample size and the number of predictors, what is the best estimator is less clear. Thus, you have simulation studies such as Yin and Fan (2001) that have evaluated different adjusted r-square formulas in terms of how well they estimate $rho^2$ (see this question for further discussion).

You will see with all the formulas, the difference between $R^2$ and $R^2_{adj}$ gets smaller as the sample size increases. The difference approaches zero as sample size tends to infinity. The difference also get smaller with fewer predictors.

3. How to interpret $R^2_{adj}$?

$R^2_{adj}$ is an estimate of the proportion of variance explained by the true regression equation in the population $rho^2$. You would typically be interested in $rho^2$ where you are interested in the theoretical linear prediction of a variable. In contrast, if you are more interested in prediction using the sample regression equation, such is often the case in applied settings, then some form of cross-validated $R^2$ would be more relevant.

References

Yin, P., & Fan, X. (2001). Estimating $R^2$ shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF

EDi · Answer

Regarding your first question: If you don't know how it is calculated look at the code! If you type summary.lm in your console, you get the code for this function. If you skim throught the code you'll find a line: ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf). If you look some lines above of this line you will notice that:

ans$r.squared: is your $R^2$
n is the number of the residuals = number of observations
df.int is 0 or 1 (depending if you have a intercept)
rdf are your residual df

Question 2: From Wikipedia: 'Adjusted $R^2$ is a modification of $R^2$ that adjusts for the number of explanatory terms in a model. '

What is the adjusted R-squared formula in lm in R and how should it be interpreted?

Adjusted r-squared formulas

Textbook descriptions

Questions

2 Answers

1. What formula does `lm` in R use for adjusted r-square?

2. Why are there so many adjusted r-square formulas?

3. How to interpret $R^2_{adj}$?

References

Add your own answers!

Ask a Question