Data Exploration Led Conversion to Ordinal Variable

Question

[I encountered this question at an interview few weeks ago and I am still not clear.]

If all the values in a categorical column fuel_mileage come from the set {poor, good, very_good}, then we can make the column ordinal due to universal and ordered relationship amongst {poor, good, very_good}, so this is kind of obvious.

However, imagine the label column in this same dataset is engine_longevity, so that we are studying all other variables in the context of their relationship with it. During data exploration, it turns out that another categorical column, manufacturer, all of whose values come from set {H, S, J, K}, has a very strong correlation with label engine_longevity, so much so that the choice of H, S, J, K in a given sample essentially dictates the label. Therefore, as for as this data set is concerned, H, S, J, K have an ordered relationship with respect to label engine_longevity. The question is:

Will you make column manufacturer ordinal? If yes, how strong should the relationship between manufacturer and the label engine_longevity be? And what metric will you use to measure it? 
If you will not make manufacturer column ordinal, why?
More generally, should the choice of making a column ordinal come from the mutual-relationship of values within that column alone? Or, the relationship of values in a column with the label should be taken into consideration?

If there is no hard-and-fast rule, I would like to know how the community here will approach this situation.

Brian Spiering · Answer

You are describing highly correlation features. The most common way to measure the correlation between two variables measured on at least an ordinal scale is the Spearman rank-order correlation coefficient.
Generally if two features are near perfectly correlated, one feature can be dropped from analysis.

Data Exploration Led Conversion to Ordinal Variable

One Answer

Add your own answers!

Ask a Question