Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

Question

I am working with a data set that, besides customer age and income, tells the balance a customer has in different type of bank accounts: Checking, Shares, Investment, Savings, Deposit, Mortgage, Loan, and Certificates. For accounts other than Checking, 0 represents that the account does not exist for the customers. There are 9800 customer observations with roughly 6000 checking accounts and 4000 savings accounts. For the others, the observations are less than 300.

I have to use K-Means Clustering analysis for the segmentation with the objective to understand how customers use savings and investment offerings and I am using the Elbow Method to predict the number of clusters. I am confused whether to use a variable like Investment with just 250 observations with another like Savings that has 4000. If I do use such variables, then these are heavily positively skewed and I'm not sure if K-Means handles that well. Can someone advise whether to include such variables or not?

bonez001 · Answer

I suggest you use an algorithm that accommodates categorical variable since there are missing data. You can one hot encode it so missing data will be relevant. Making it zero will be misleading.

Try algorithms like tSNE and Self Organizing Map and use the Jaccard/Tanimoto distance.

Customer Segmentation: Should I use a variable, representing a product, that is unpopular in the dataset for K-Means Clustering?

One Answer

Add your own answers!

Ask a Question