Data Science Asked by Matt J on June 29, 2021
I have a data-set with 14~ attributes, roughly half of them nominal. I’ve used a binary vectorizer to convert these values to a number of attributes. The number of attributes, naturally, ballooned up; I’m sitting on roughly 50 at the moment. I’ve looked at using PCA to reduce this number.
As far as I can understand the things I’ve been reading, I should exclude my target variable from the analysis. But I’m not sure if I should perform PCA on the whole remaining data set (Including the values that were already numerical, like ‘age’) or just on the values I converted from Nominal to Numerical and then re-add those to the already numeric values.
To clarify, I’ve already converted this data-set from nominal to binary, and I’m not sure if I should apply PCA to just the binary columns generated, or the entire thing.
It is not advised for you to apply PCA on a dataset with nominal values. You can, but PCA translates variables in space. It is hard to find the relationship between nominal values in space. For example, how would one quantify the space between 'male' and 'female' or 'white' and 'red' or 'PC' and 'mobile phone'?
Some alternatives are
Answered by Toby on June 29, 2021
For categorical attributes, use correspondence anlaysis rather than PCA. Since you tagged this "pandas", here's a python package.
Answered by David Marx on June 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP