Correlations, p-values and features selection

Question

By using correlation matrix, I got some results:
Count_words          -0.098857
Count_numbers        -0.008305
Count_symbols        -0.025853
Count_question       -0.031649
Count_equal           0.224223
Count_characters      0.09

I used this line of code (in case you are familiar with Python):     df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))
If I understand correctly, the above results should suggest that there is not correlation between the variables considered.
Since I would like to add the above variables (or some of them) in a model already built with other features (textual), I would like to know if I can include all of them based on the fact that they are not correlated to each other and that the p-value is less than 0.05. My doubt is  if the above results do not make sense and do not suggest that these variables can be used in the model.
I hope you can give me some advices on that. Thanks

Erwan · Accepted Answer

The fact that a feature has low correlation with the target variable shows that it's not a good indicator on its own, but that doesn't mean that it can't be useful for the model when combined with the other features.
The only way to know if these features are useful is to use them to train a model, then evaluate on a validation set and see if it improves performance.

the p-value is less than 0.05

Is this the result of a correlation significance test? It depends on the test but in general a p-value lower than 0.05 means that there is a significant difference, i.e. in this case it probably means that the correlation is truly not zero. Anyway imho this wouldn't prove anything with respect to using these features or not.

Correlations, p-values and features selection

One Answer

Add your own answers!

Ask a Question