TransWikia.com

Correlations, p-values and features selection

Data Science Asked on April 21, 2021

By using correlation matrix, I got some results:

Count_words          -0.098857
Count_numbers        -0.008305
Count_symbols        -0.025853
Count_question       -0.031649
Count_equal           0.224223
Count_characters      0.09

I used this line of code (in case you are familiar with Python): df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))

If I understand correctly, the above results should suggest that there is not correlation between the variables considered.
Since I would like to add the above variables (or some of them) in a model already built with other features (textual), I would like to know if I can include all of them based on the fact that they are not correlated to each other and that the p-value is less than 0.05. My doubt is if the above results do not make sense and do not suggest that these variables can be used in the model.
I hope you can give me some advices on that. Thanks

One Answer

The fact that a feature has low correlation with the target variable shows that it's not a good indicator on its own, but that doesn't mean that it can't be useful for the model when combined with the other features.

The only way to know if these features are useful is to use them to train a model, then evaluate on a validation set and see if it improves performance.

the p-value is less than 0.05

Is this the result of a correlation significance test? It depends on the test but in general a p-value lower than 0.05 means that there is a significant difference, i.e. in this case it probably means that the correlation is truly not zero. Anyway imho this wouldn't prove anything with respect to using these features or not.

Correct answer by Erwan on April 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP