What is the intuition of using clustering for performing feature engineering in machine learning tasks?

Question

I am trying to implement the research paper Combining Boosted Trees with Metafeature Engineering for Predictive Maintenance. The paper has a section called meta feature engineering where they have used hierarchical clustering to create features. The paper says:

The third method we used to analyze the outliers in the dataset is based on an
hierarchical Agglomerative Clustering algorithm [5].
Hierarchical Agglomerative Clustering starts with Z groups (Z being the
number of observations), each initially containing one object, and then at each
step it merges the two most similar groups until there is only one single group,
containing all data.
The rationale for this method is that the last observation that are merged
might still be significantly different from the group they are merged into. By
definition outliers are different cases and will typically not fit well into a cluster,
unless that cluster is comprised by other outliers itself. Yet again, since these
are not ordinary data points, we do not expect them to form large groups.

I am unable to understand the authors' intuition behind doing this.
The problem I am trying to solve and the paper is related to is the IDA-2016 competition dataset. You can find more about the competition here

Erwan · Accepted Answer

Overall the paper is not very clear so there are a few uncertainties, but the general approach is this:

Their main idea is to create new features which represent the "outlyingness" of the instance. They use several different methods in order to detect outliers, however they do not explain how exactly are the new features created.
One of the methods they use to detect outliers is based on hierarchical clustering: the result of such clustering is a binary tree in which the most distant clusters/instances are connected last, i.e. close to the root of the tree. Their assumption is that an outlier tends not to be close to any other instance or cluster, therefore they are connected last. While the method makes sense, it's not clear in the paper whether they only retain the very last instance as outlier or the last few instances (or any other variant).

So the clustering and the feature creation are only indirectly related:

The clustering is used to detect one or several outliers in the instances, and several other methods are used for the same purpose.
Based on the detection of these outliers, one or several new features are created which contain a value describing the "outlier status" of the instance. The simplest option would be to create a single boolean feature which is true if the instance was detected as outlier by any of the methods, but one can imagine more advanced options. For example, based on the hierarchical clustering one can obtain the order in which the instances are connected, and the rank of every instance can be used as a feature.

What is the intuition of using clustering for performing feature engineering in machine learning tasks?

One Answer

Add your own answers!

Ask a Question