TransWikia.com

Should I apply PCA on the entire dataset or just the nominal values?

Data Science Asked by Matt J on June 29, 2021

I have a data-set with 14~ attributes, roughly half of them nominal. I’ve used a binary vectorizer to convert these values to a number of attributes. The number of attributes, naturally, ballooned up; I’m sitting on roughly 50 at the moment. I’ve looked at using PCA to reduce this number.

As far as I can understand the things I’ve been reading, I should exclude my target variable from the analysis. But I’m not sure if I should perform PCA on the whole remaining data set (Including the values that were already numerical, like ‘age’) or just on the values I converted from Nominal to Numerical and then re-add those to the already numeric values.

To clarify, I’ve already converted this data-set from nominal to binary, and I’m not sure if I should apply PCA to just the binary columns generated, or the entire thing.

2 Answers

It is not advised for you to apply PCA on a dataset with nominal values. You can, but PCA translates variables in space. It is hard to find the relationship between nominal values in space. For example, how would one quantify the space between 'male' and 'female' or 'white' and 'red' or 'PC' and 'mobile phone'?

Some alternatives are

  1. Use a tree model, like random forest, which will easily handle nominal values.
  2. Use FactoMine in R.

Answered by Toby on June 29, 2021

For categorical attributes, use correspondence anlaysis rather than PCA. Since you tagged this "pandas", here's a python package.

Answered by David Marx on June 29, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP