Should I apply PCA on the entire dataset or just the nominal values?

Question

I have a data-set with 14~ attributes, roughly half of them nominal. I've used a binary vectorizer to convert these values to a number of attributes. The number of attributes, naturally, ballooned up; I'm sitting on roughly 50 at the moment. I've looked at using PCA to reduce this number.

As far as I can understand the things I've been reading, I should exclude my target variable from the analysis. But I'm not sure if I should perform PCA on the whole remaining data set (Including the values that were already numerical, like 'age') or just on the values I converted from Nominal to Numerical and then re-add those to the already numeric values.

To clarify, I've already converted this data-set from nominal to binary, and I'm not sure if I should apply PCA to just the binary columns generated, or the entire thing.

Toby · Answer

It is not advised for you to apply PCA on a dataset with nominal values. You can, but PCA translates variables in space. It is hard to find the relationship between nominal values in space. For example, how would one quantify the space between 'male' and 'female' or 'white' and 'red' or 'PC' and 'mobile phone'?

Some alternatives are

Use a tree model, like random forest, which will easily handle nominal values.
Use FactoMine in R.

Some alternatives are

Use a tree model, like random forest, which will easily handle nominal values.
Use FactoMine in R.

David Marx · Answer

For categorical attributes, use correspondence anlaysis rather than PCA. Since you tagged this "pandas", here's a python package.

Should I apply PCA on the entire dataset or just the nominal values?

2 Answers

Add your own answers!

Ask a Question