K-Means clustering for mixed numeric and categorical data

Question

My data set contains a number of numeric attributes and one categorical.
Say, NumericAttr1, NumericAttr2, ..., NumericAttrN, CategoricalAttr,
where CategoricalAttr takes one of three possible values: CategoricalAttrValue1, CategoricalAttrValue2 or CategoricalAttrValue3.
I'm using default k-means clustering algorithm implementation for Octave.
It works with numeric data only.
So my question: is it correct to split the categorical attribute CategoricalAttr into three numeric (binary) variables, like IsCategoricalAttrValue1, IsCategoricalAttrValue2, IsCategoricalAttrValue3 ?

Tim Goodman · Accepted Answer

The standard k-means algorithm isn't directly applicable to categorical data, for various reasons.  The sample space for categorical data is discrete, and doesn't have a natural origin.  A Euclidean distance function on such a space isn't really meaningful.  As someone put it, "The fact a snake possesses neither wheels nor legs allows us to say nothing about the relative value of wheels and legs." (from here)

There's a variation of k-means known as k-modes, introduced in this paper by Zhexue Huang, which is suitable for categorical data.   Note that the solutions you get are sensitive to initial conditions, as discussed here (PDF), for instance.

Huang's paper (linked above) also has a section on "k-prototypes" which applies to data with a mix of categorical and numeric features.  It uses a distance measure which mixes the Hamming distance for categorical features and the Euclidean distance for numeric features.

A Google search for "k-means mix of categorical data" turns up quite a few more recent papers on various algorithms for k-means-like clustering with a mix of categorical and numeric data.  (I haven't yet read them, so I can't comment on their merits.)

Actually, what you suggest (converting categorical attributes to binary values, and then doing k-means as if these were numeric values) is another approach that has been tried before (predating k-modes).  (See Ralambondrainy, H. 1995. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16:1147–1157.)  But I believe the k-modes approach is preferred for the reasons I indicated above.

user490 · Answer

You can also give the Expectation Maximization clustering algorithm a try. It can work on categorical data and will give you a statistical likelihood of which categorical value (or values) a cluster is most likely to take on.

Answered by user490 on February 24, 2021

adesantos · Answer

In my opinion, there are solutions to deal with categorical data in clustering. R comes with a specific distance for categorical data. This distance is called Gower and it works pretty well.

Jordan A · Answer

This question seems really about representation, and not so much about clustering.

Categorical data is a problem for most algorithms in machine learning. Suppose, for example, you have some categorical variable called "color" that could take on the values red, blue, or yellow. If we simply encode these numerically as 1,2, and 3 respectively, our algorithm will think that red (1) is actually closer to blue (2) than it is to yellow (3). We need to use a representation that lets the computer understand that these things are all actually equally different.

One simple way is to use what's called a one-hot representation, and it's exactly what you thought you should do. Rather than having one variable like "color" that can take on three values, we separate it into three variables. These would be "color-red," "color-blue," and "color-yellow," which all can only take on the value 1 or 0.

This increases the dimensionality of the space, but now you could use any clustering algorithm you like. It does sometimes make sense to zscore or whiten the data after doing this process, but the your idea is definitely reasonable.

Dynamic Stardust · Answer

(In addition to the excellent answer by Tim Goodman)

The choice of k-modes is definitely the way to go for stability of the clustering algorithm used.

The clustering algorithm is free to choose any distance metric / similarity score. Euclidean is the most popular. But any other metric can be used that scales according to the data distribution in each dimension /attribute, for example the Mahalanobis metric.

With regards to mixed (numerical and categorical) clustering a good paper that might help is: INCONCO: Interpretable Clustering of Numerical and Categorical Objects
Beyond k-means: Since plain vanilla k-means has already been ruled out as an appropriate approach to this problem, I'll venture beyond to the idea of thinking of clustering as a model fitting problem. Different measures, like information-theoretic metric: Kullback-Liebler divergence work well when trying to converge a parametric model towards the data distribution.
(Of course parametric clustering techniques like GMM are slower than Kmeans, so there are drawbacks to consider)
Fuzzy k-modes clustering also sounds appealing since fuzzy logic techniques were developed to deal with something like categorical data. See Fuzzy clustering of categorical data using fuzzy centroids for more information.

Also check out: ROCK: A Robust Clustering Algorithm for Categorical Attributes

Ram · Answer

It depends on your categorical variable being used. For ordinal variables, say like bad,average and good, it makes sense just to use one variable and have values 0,1,2 and distances make sense here(Avarage is closer to bad and good). However, if there is no order, you should ideally use one hot encoding as mentioned above.

Answered by Ram on February 24, 2021

Toros91 · Answer

If we consider a scenario where the categorical variable cannot be hot encoded like the categorical variable has 200+ categories.

In such cases you can use a package 
clustMixType

It can handle mixed data(numeric and categorical), you just need to feed in the data, it automatically segregates Categorical and Numeric data.

If you find any issues like some numeric is under categorical then you can you as.factor()/ vice-versa as.numeric(), on that respective field and convert that to a factor and feed in that new data to the algorithm.

Calculate lambda, so that you can feed-in as input at the time of clustering.

we can even get a WSS(within sum of squares), plot(elbow chart) to find the optimal number of Clusters.

Hope this answer helps you in getting more meaningful results.

Sam - Founder of AceAINow.com · Answer

You should not use k-means clustering on a dataset containing mixed datatypes. Rather, there are a number of clustering algorithms that can appropriately handle mixed datatypes. Some possibilities include the following:

1) Partitioning-based algorithms: k-Prototypes, Squeezer
2) Hierarchical algorithms: ROCK, Agglomerative single, average, and complete linkage
3) Density-based algorithms: HIERDENC, MULIC, CLIQUE
4) Model-based algorithms: SVM clustering, Self-organizing maps

If you would like to learn more about these algorithms, the manuscript 'Survey of Clustering Algorithms' written by Rui Xu offers a comprehensive introduction to cluster analysis.

Valentin Calomme · Answer

K-Means' goal is to reduce the within-cluster variance, and because it computes the centroids as the mean point of a cluster, it is required to use the Euclidean distance in order to converge properly. Therefore, if you want to absolutely use K-Means, you need to make sure your data works well with it.

Representation

K-Means, and clustering in general, tries to partition the data in meaningful groups by making sure that instances in the same clusters are similar to each other. Therefore, you need a good way to represent your data so that you can easily compute a meaningful similarity measure.

Using one-hot encoding on categorical variables is a good idea when the categories are equidistant from each other. For instance, if you have the colour light blue, dark blue, and yellow, using one-hot encoding might not give you the best results, since dark blue and light blue are likely "closer" to each other than they are to yellow.

In case the categorical value are not "equidistant" and can be ordered, you could also give the categories a numerical value. For instance, kid, teenager, adult, could potentially be represented as 0, 1, and 2. This would make sense because a teenager is "closer" to being a kid than an adult is.

K-Medoids

A more generic approach to K-Means is K-Medoids. K-Medoids works similarly as K-Means, but the main difference is that the centroid for each cluster is defined as the point that reduces the within-cluster sum of distances. Enforcing this allows you to use any distance measure you want, and therefore, you could build your own custom measure which will take into account what categories should be close or not.

user200668 · Answer

Mixture models can be used to cluster  a data set composed of continuous and categorical variables.
You can use the R package VarSelLCM (available on CRAN) which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables. Take care to store your data in a data.frame where continuous variables are "numeric" and categorical variables are "factor".
A tutorial is available here
Moreover, missing values can be managed by the model at hand.

Tarun Kumar Yellapu · Answer

Many of the above pointed that k-means can be implemented on variables which are categorical and continuous, which is wrong and the results need to be taken with a pinch of salt.
As mentioned above by @Tim above, it doesn't make sense to compute the euclidian distance between the points which neither have a scale nor have an order. When you one-hot encode the categorical variables you generate a sparse matrix of 0's and 1's. As the range of the values is fixed and between 0 and 1 they need to be normalised in the same way as continuous variables. The Z-scores are used to is used to find the distance between the points. Which is still, not perfectly right. I will explain this with an example. As the categories are mutually exclusive the distance between two points with respect to categorical variables, takes either of two values, high or low ie, either the two points belong to the same category or they are not. Due to these extreme values, the algorithm ends up giving more weight over the continuous variables in influencing the cluster formation. This can be verified by a simple check by seeing which variables are influencing and you'll be surprised to see that most of them will be categorical variables. (Ways to find the most influencing variables 1)
An example: Consider a categorical variable country. Now as we know the distance(dissimilarity) between observations from different countries are equal (assuming no other similarities like neighbouring countries or countries from the same continent). But in contrary to this if you calculate the distances between the observations after normalising the one hot encoded values they will be inconsistent(though the difference is minor) along with the fact that they take high or low values.
Ultimately the best option available for python is k-prototypes which can handle both categorical and continuous variables.
Finding most influential variables in cluster formation

Tim Ruhkopf · Answer

I came across the very same problem and tried to work my head around it (without knowing k-prototypes existed) the rich literature i found my self encountered with originated from the idea of not measuring the variables with the same distance metric at all. Further more there may exist various sources of information, that may imply different structures or "views" of the data. This is a natural problem, whenever you face social relationships such as those on twitter / websites etc.

One of the possible solutions is to address each subset of variables (i.e. numerical & categorical) seperately. It is easily comprehandable what a distance measure does on a numeric scale. Categorical data on its own can just as easily be understood: Consider having binary observation vectors: The contingency table on 0/1 between two observation vectors contains lots of information about the simmilarity between those two observations. There is rich literature upon the various customized similarity measures on binary vectors - most starting from the contingency table.

Given both distance / similarity matrices, both describing the same observations, one can extract a graph on each of them (Multi-View-Graph-Clustering) or extract a single graph with multiple edges - each node (observation) with as many edges to another node, as there are information matrices (Multi-Edge-Clustering). Each edge being assigned the weight of the corresponding simmilarity / distance measure. Start here: Github listing of Graph Clustering Algorithms & their papers.
As there are multiple information sets available on a single observation, these must be interweaved using e.g. descendents of spectral analysis or linked matrix factorization. The spectral analysis being the default method for finding highly connected or heavily weighted parts of single graphs. Having a spectral embedding of the interweaved data, any clustering algorithm on numerical data may easily work. Literature's default is kmeans for the matter of simplicity, but far more advanced - and not as restrictive algorithms are out there which can be used interchangeably in this context.

I liked the beauty and generality in this approach, as it is easily extendible to multiple information sets rather than mere dtypes and further its respect for the specific "measure" on each data subset. This does not alleviate you from fine tuning the model with various distance & similarity metrics or scaling your variables (i found myself scaling the numerical variables to ratio-scales ones in the context of my analysis)

From a scalability perspective considere, that there are mainly two problems:

Eigen problem approximation (where a rich literature of algorithms exists as well)
Distance matrix estimation (a purely combinatorical problem, that grows large very quickly - i haven't found an efficient way around it yet)

Have fun with it!

user42229 · Answer

You might want to look at automatic feature engineering. The method is based on Bourgain Embedding and can be used to derive numerical features from mixed categorical and numerical data frames or for any data set which supports distances between two data points. Having transformed the data to only numerical features, one can use K-means clustering directly then

Answered by user42229 on February 24, 2021

K-Means clustering for mixed numeric and categorical data

13 Answers

Representation

K-Medoids

Add your own answers!

Ask a Question