Cross Validated Asked by user1272262 on February 16, 2021
What is the exact formula used in R lm()
for the Adjusted R-squared? How can I interpret it?
There seem to exist several formulas to calculate Adjusted R-squared.
I had previously thought, and read widely, that R-squared penalizes for adding additional variables to the model. Now the use of these different formulas seems to call for different interpretations. I also looked at a related question on Stack Overflow (What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?), and the Wharton school’s statistical dictionary at UPenn.
lm()
?lm
in R use for adjusted r-square?As already mentioned, typing summary.lm
will give you the code that R uses to calculate adjusted R square. Extracting the most relevant line you get:
ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
which corresponds in mathematical notation to:
$$R^2_{adj} = 1 - (1 - R^2) frac{n-1}{n-p-1}$$
assuming that there is an intercept (i.e., df.int=1
), $n$ is your sample size, and $p$ is your number of predictors. Thus, your error degrees of freedom (i.e., rdf
) equals n-p-1
.
The formula corresponds to what Yin and Fan (2001) label Wherry Formula-1 (there is apparently another less common Wherry formula that uses $n-p$ in the denominator instead $n-p-1$). They suggest it's most common names in order of occurrence are "Wherry formula", "Ezekiel formlua", "Wherry/McNemar formula", and "Cohen/Cohen formula".
$R^2_{adj}$ aims to estimate $rho^2$, the proportion of variance explained in the population by the population regression equation. While this is clearly related to sample size and the number of predictors, what is the best estimator is less clear. Thus, you have simulation studies such as Yin and Fan (2001) that have evaluated different adjusted r-square formulas in terms of how well they estimate $rho^2$ (see this question for further discussion).
You will see with all the formulas, the difference between $R^2$ and $R^2_{adj}$ gets smaller as the sample size increases. The difference approaches zero as sample size tends to infinity. The difference also get smaller with fewer predictors.
$R^2_{adj}$ is an estimate of the proportion of variance explained by the true regression equation in the population $rho^2$. You would typically be interested in $rho^2$ where you are interested in the theoretical linear prediction of a variable. In contrast, if you are more interested in prediction using the sample regression equation, such is often the case in applied settings, then some form of cross-validated $R^2$ would be more relevant.
Answered by Jeromy Anglim on February 16, 2021
Regarding your first question: If you don't know how it is calculated look at the code! If you type summary.lm
in your console, you get the code for this function. If you skim throught the code you'll find a line: ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
. If you look some lines above of this line you will notice that:
ans$r.squared
: is your $R^2$n
is the number of the residuals = number of observationsdf.int
is 0 or 1 (depending if you have a intercept)rdf
are your residual dfQuestion 2: From Wikipedia: 'Adjusted $R^2$ is a modification of $R^2$ that adjusts for the number of explanatory terms in a model. '
Answered by EDi on February 16, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP