Cross Validated Asked by Joanne Cheung on December 30, 2020

I’m trying a relaxed lasso logistic regression by first using sklearn’s cross validation to find an optimal penalty parameter (C = 1/lambda). Then, I use that parameter to fit statsmodel’s logit model to the data (lambda = 1/C). At this step, I removed coefficients that are really small (< 1e-5). When I performed cross validation again on the reduced feature set to estimate my second penalty parameter, it ends up being stronger than the first penalty. This doesn’t seem right to me, since the second step should be to find better estimates for features that you’ve already identified as non-zero from the first feature reduction step (so the lambda should be weaker).

I’ve also noticed that a lot of p values for the coefficients in the statsmodel results are NaN, even though the model has converged. Does this happen because there is high multicollinearity among my features? I had assumed that the Lasso would be able to handle collinearity. Perhaps my cross validation isn’t providing the best penalty parameter?

From Elements of Statistical Learning:

…the lasso shrinkage causes the estimates of the non-zero coefficients to be biased towards zero, and in general they are not consistent. One approach for reducing this bias is to run the lasso to identify the set of non-zero coefficients, and then fit an unrestricted linear model to the selected set of features. This is not always feasible, if the selected set is large. Alternatively, one can use the lasso to select the set of non-zero predictors, and then apply the lasso again, but using only the selected predictors from the first step. This is known as the relaxed lasso (Meinshausen, 2007). The idea is to use cross-validation to estimate the initial penalty parameter for the lasso, and then again for a second penalty parameter applied to the selected set of predictors. Since Statistical consistency means as the sample size grows, the estimates converge to the true values. The variables in the second step have less “competition” from noise variables, cross-validation will tend to pick a smaller value for λ, and hence their coefficients will be shrunken less than those in the initial estimate.

It looks like the summary table for logit always gives a NaN standard error for parameters whose coefficient estimate is shrunk to zero. Is this where you are seeing NaN standard errors? If so, these are not really NaN's, it's just that there isn't a good way to come up with a standard error for parameters that are shrunk to zero (or at least one has not been implemented here). I'm also not sure how the SE's for the nonzero coefficients are computed here.

```
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 1000
Model: Logit Df Residuals: 998
Method: MLE Df Model: 1
Date: Thu, 28 Feb 2019 Pseudo R-squ.: 0.1600
Time: 20:48:32 Log-Likelihood: -582.11
converged: True LL-Null: -692.99
LLR p-value: 3.763e-50
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 0 nan nan nan nan nan
x1 0.8917 0.079 11.355 0.000 0.738 1.046
x2 -0.4602 0.073 -6.284 0.000 -0.604 -0.317
x3 0 nan nan nan nan nan
==============================================================================
```

Answered by Kerby Shedden on December 30, 2020

Are you using statsmodels GLM or statsmodels Logit? They are equivalent, but the scaling of the penalty parameter for regularized fitting is different. Below is an example of how to get equivalent results from these two procedures and from sklearn.

```
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
n = 1000
x = np.random.normal(size=(n, 3))
lp = x[:, 0] - x[:, 1] / 2
pr = 1 / (1 + np.exp(-lp))
y = (np.random.uniform(size=n) < pr).astype(np.int)
df = pd.DataFrame({"y": y, "x1": x[:, 0], "x2": x[:, 1], "x3": x[:, 2]})
alpha = 10
model1 = sm.GLM.from_formula("y ~ x1 + x2 + x3", family=sm.families.Binomial(),
data=df)
result1 = model1.fit_regularized(alpha=alpha/n, L1_wt=1)
model2 = sm.Logit.from_formula("y ~ x1 + x2 + x3", data=df)
result2 = model2.fit_regularized(alpha=alpha)
x0 = np.concatenate((np.ones((n, 1)), x), axis=1)
model3 = LogisticRegression(C=1/alpha, penalty='l1')
result3 = model3.fit(x0, y)
print(result1.params, result2.params, result3.coef_)
```

Answered by Kerby Shedden on December 30, 2020

Get help from others!

Recent Answers

- Peter Machado on Why fry rice before boiling?
- haakon.io on Why fry rice before boiling?
- Joshua Engel on Why fry rice before boiling?
- Lex on Does Google Analytics track 404 page responses as valid page views?
- Jon Church on Why fry rice before boiling?

Recent Questions

- How can I transform graph image into a tikzpicture LaTeX code?
- How Do I Get The Ifruit App Off Of Gta 5 / Grand Theft Auto 5
- Iv’e designed a space elevator using a series of lasers. do you know anybody i could submit the designs too that could manufacture the concept and put it to use
- Need help finding a book. Female OP protagonist, magic
- Why is the WWF pending games (“Your turn”) area replaced w/ a column of “Bonus & Reward”gift boxes?

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP