Data Science Asked by Akash Dubey on January 12, 2021
I am trying to implement k-means clustering on 60-70 features and I came across a post for feature selection technique on quora by Julian Ramos, but I fail to understand few steps mentioned. I am also wondering if its the right method to select the best features for clustering?
These are the steps mentioned in the post :
Sf={∅} #Set of features selected, initially empty
- Perform k-means on each of the features individually for some k.
- For each cluster measure some clustering performance metric like the Dunn’s index or silhouette.
- Take the feature which gives you the best performance and add it to Sf
- Perform k-means on Sf and each of the remaining features individually
- Take the feature which gives you the best performance and add it to Sf
- If you have reached the desired number of features stop, else go back to 4
Also, how do we implement the same in python. I wish to write function for the same that selects best k
and implement all the other steps.
You might be interested in this paper and python implementation of various other feature selection for clustering tools and papers:
http://www.public.asu.edu/~huanliu/papers/pakdd00clu.pdf
https://github.com/danilkolikov/fsfc
An excerpt sumarizing the approach:
We address the problem of selecting a subset of important features for clus tering for the whole data and not just for clusters unlike in [1,2] This helps in knowing the important features before doing clustering and the clustering task becomes more ecient and focused as only the important features can be used Finding the important original features for the whole data helps in under standing the data better unlike principal components Data storage collection and processing tasks become more efficient and noise is reduced as the data is pruned
Our approach is a 2-step method we first rank and then select a subset of important features. Ranking of features is done according to their importance on clustering An entropy based ranking measure is introduced We then select a subset of features using a criterion function for clustering that is invariant with respect to different numbers of features A novel scalable method based on random sampling is introduced for large data commonly found in data mining applications
Additionally, this paper gives an overview of different methods available:
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.8115&rep=rep1&type=pdf
Answered by Jed on January 12, 2021
Often people confuse between unsupervised feature selection (UFS) and dimensionality reduction (DR) algorithms as the same. For instance, a famous DR algorithm is Principal Component Analysis (PCA) which is often confused as a UFS method! Researchers have suggested that PCA is a feature extraction algorithm and not feature selection because it transforms the original feature set into a subset of interrelated transformed features, which are difficult to emulate (Abdi & Williams, 2010).
An UFS approach present in literature is Principal Feature Analysis PFA. The way it works is given as; Steps:
A possible python implementation of PFA is given below;
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from collections import defaultdict
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.preprocessing import StandardScaler
import pandas as pd
# create some dummy data
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
print(df)
class PFA(object):
def __init__(self, n_features, q=None):
self.q = q
self.n_features = n_features
def fit(self, X):
if not self.q:
self.q = X.shape[1]
sc = StandardScaler()
X = sc.fit_transform(X)
pca = PCA(n_components=self.q).fit(X) # calculation Cov matrix is embeded in PCA
A_q = pca.components_.T
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances([A_q[i, :]], [cluster_centers[c, :]])[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
# Usage
pfa = PFA(n_features=3)
pfa.fit(df)
# To get the transformed matrix
x = pfa.features_
print(x)
# To get the column indices of the kept features
column_indices = pfa.indices_
Results
num_legs num_wings num_specimen_seen
falcon 2 2 10
dog 4 0 2
spider 8 0 1
fish 0 0 8
[[-0.50709255 1.73205081 1.23942334]
[ 0.16903085 -0.57735027 -0.84802649]
[ 1.52127766 -0.57735027 -1.10895772]
[-1.18321596 -0.57735027 0.71756088]]
Reference
Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433-459.
Answered by mnm on January 12, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP