Should features be correlated or uncorrelated for classification?

Question

I have seen researchers using pearson's correlation coefficient to find out the relevant features --  to keep the features that have a high correlation value with the target. The implication is that the correlated features contribute more information in finding out the target in classification problems. Whereas, we remove the features which are redundant and have very negligible correlation value.

Q1) Should highly correlated features with the target variable be included or removed from classification problems ? Is there a better/elegant explanation to this step?

Q2) How do we know that the dataset is linear when there are multiple variables involved? What does it mean by dataset being linear?

Q3) How to check for feature importance for non-linear case?

Erwan · Accepted Answer

Q1) Should highly correlated features with the target variable be included or removed from classification and regression problems? Is there a better/elegant explanation to this step?

Actually there's no strong reason either to keep or remove features which have a low correlation with the target response, other than reducing the number of features if necessary:

It is correct that correlation is often used for feature selection. Feature selection is used for dimensionality reduction purposes, i.e. mostly to avoid overfitting due to having too many features / not enough instances (it's a bit more complex than this but that's the main idea). My point is that there's little to no reason to remove features if the number of features is not a problem, but if it is a problem then it makes sense to keep only the most informative features, and high correlation is an indicator of "informativeness" (information gain is another common measure to select features).
In general feature selection methods based on measuring the contribution of individual features are used because they are very simple and don't require complex computations. However they are rarely optimal because they don't take into account the complementarity of groups of features together, something that most supervised algorithms can use very well. There are more advanced methods available which can take this into account: the most simple one is a brute-force method which consists in repeatedly measuring the performance (usually with cross-validation) with any possible subset of features... But that can take a lot of time for a large set of features.

However features which are highly correlated together (i.e. between features, not with the target response), should usually be removed because they are redundant and some algorithms don't deal very well with those. It's rarely done systematically though, because again this involves a lot of calculations.

Q2) How do we know that the dataset is linear when there are multiple variable involved? What does it mean by dataset being linear?

It's true that correlation measures are based on linearity assumptions, but that's rarely the main issue: as mentioned above it's used as an easy indicator of "amount of information" and it's known to be imperfect anyway, so the linearity assumption is not so crucial here.

A dataset would be linear if the response variable can be expressed as a linear equation of the features (i.e. in theory one would obtain near-perfect performance with a linear regression).

Q3) How to do feature importance for nonlinear case?

Information gain, KL divergence, and probably a few other measures. But using these to select features individually is also imperfect.

Desmond · Answer

for feature engineering there are different methods.

Pearson Correlation comes under Filter methods. Filter methods gives intuition on the high level. This can be the first step for feature engineering. In this process

the features having high correlation with target should be considered.
the features having high correlation among themselves should also be removed as, "they are acting two independent variables doing same work" then why keep both.

After considering the correlation approaches you can also dig in to the Wrapper based methods which are more robust for feature selection but that includes the burden of training process.

Refer this for introduction to the different approaches.

Subhash C. Davar · Answer

Given that several  correlation measures are in vogue. A high correlation does not guarantee  a substantive relation. Test it before inclusion in model                     2. Linear or nonlinear relationship needs an examination of individual variables. Some variables are likely  to have either a linear or nonlinear relationship with target  variable. Remaining variables may not have any relationship with target variable. 3. Your question is too vague and checking for features importance has nothing to  do with non-
linearity.

Should features be correlated or uncorrelated for classification?

3 Answers

Add your own answers!

Ask a Question