How to structure my data into features and targets for PCA on Big Data?

Question

I want to apply the PCA algorithm from Scikit-Learn.(https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html ) At the part where I have to separate the features and the targets I got a little bit lost. On the Iris dataset is easy and I understood, but now I am facing a real dataset. My problem is that part of the original data was meant to be into an array and as the database could not hold it then the data was split into columns. However, some columns are referring to the same thing. Please consider the example below.

To be more specific, I designed an example dataframe which is almost the same as mine.

Explained dataset: At a respective date, at a specific place, we had the same four kids being measured. Robin, Ted, Lilly and Jamie are their names. Each number next to their names refers to their height, weight, and other specific measurements. The database could not keep the array for each child with their measurements (height, weight, etc) so the data from the array was split into columns (Robin_1, Robin_2, Robin_3, etc.. for each child).

Specific problem: I do not know how to structure my features/targets. I selected the Date (which is also the index), Country, City and School as being the features. On them I apply the principal components (3 in number -> 3D plot). How should I select the target ? Is there any way on somehow grouping them before clustering? In the final result, I want to cluster each the full Robin, full Ted, full Lilly and so on. I have exactly four kids, but their attributes differ (each one can have different number of attributes (ex. Robin -2, Ted- 10, Lilly-7, etc)) --> four clusters
P.S.: I cannot cluster them by the height, weight, etc. Those are only example attributes.

A visualization image plot of the final wanted result:

Any help would be of use and I would highly appreciate any effort in helping me with this problem. I have been reading a lot but I have not met this problem before. If there is another way to structure/consider them please let me know. Please let me know if I have to be more explicit in a respective part if I did not explained/tell something important.

Has QUIT--Anony-Mousse · Answer

First of all, if the "weight" etc. attributes should not be used for clustering, then don't include them. It makes you question hard to follow. On the other hand, if you don't include them  there is obviously no difference between the kids... So I'm not certain you have understood what you want, need to do, or ask for.

PCA is not a magic tool that makes you data clusterable! It is a very specific tool to decorrelate data that should only be used when you have strong, known, linear correlations in your data that you need to ignore. It will not work when you have non-linear correlations (e.g., income and number of yachts), or when your data is not continuous on a linear scale. In particular you shouldn't use it on your first attributes such as date, state, etc.

What you first will want to do - and that has nothing to do with PCA nor clustering, is to pivot your data to separte the kids. Consider transforming your messy table into "tidy data" form first, this will make such cleanup much more structured and easier to do.

How to structure my data into features and targets for PCA on Big Data?

One Answer

Add your own answers!

Ask a Question