TransWikia.com

When combined correlation of features decreases

Data Science Asked by Caldass_ on August 28, 2020

I’m building a machine learning model in Python to predict soccer player values. I’m trying to predict a "player_value" column containing the value of a specific player. Consider a sample of the columns (features) I’m using.

---------------------------------
appearances | goals | goals_per_game
------------|-------|---------------       
    20      |   2   |     0.1
    60      |   20  |     0.33
    54      |   30  |     0.55
    43      |   15  |     0.34
    30      |   17  |     0.56

I thought that the correct way to use those columns would be creating a goals per game statistic (goals divided by appearances), since a player can have more goals than another player, but with less matches played.

After that, the correlation of the refered columns with the column that I’m trying to predict (player value) decreased. The correlation of the "goals" and the "appearances" columns with the player value column was about 45% each, while the new "goals_per_game" column has a correlation around 18%.

Should I use the columns "appearances" and "goals_per_game" columns individually and not use the "goals_per_game" column? Is my analysis wrong and it does not makes sense to use a "goals_per_game" metric since the player value is higher when using those features individually?

2 Answers

The only metric you can use to assess the added value of each feature, is the accuracy of your predictive classifier. You can simply include all three feature variables and build your classifier from those. Then you can remove each of the $3$ features and build the three possible classifiers from each of the remaining $2$ features (for example, 'appearances' and 'goals', 'appearances' and 'goals_per_game', and so forth). This analysis gives you the added value of each of the three features to your predictive accuracy. Note the apparent dependency between 'goals' and 'goals_per_game'.

This strategy to feature selection is called sequential backward search. If you have an excess of $10$ features, more advanced algorithms like floating search and MCMC are likely to yield better performing subsets of features. In your case with solely $3$ feature variables sequential backward search should work fine.

Correct answer by Match Maker EE on August 28, 2020

I believe(not an expert of soccer), there will be many players who have a good number of Match experience and not a very high goal per match.
Most probably, those players played from Back Or they may be the Goal-keeper. So, definitely this new feature will be unaligned with the player's value.

There is another parameter "Assist " that is used in Soccer to evaluate performance.
A categorical feature for player's position can also help.

Since, you are not saving a lot of dimensions, you can keep all the three and try.

Answered by 10xAI on August 28, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP