TransWikia.com

Tool for clustering and cleansing data set

Data Science Asked by Matthew Gertner on August 16, 2021

I have a large-ish data set (400K records) composed of two fields (both strings). I am looking for a tool that will enable me to cluster the data e.g. around the first column, either using exact matches or some kind of string proximity function like Levenshtein distance. I would also like to be able to find all duplicate records and merge them into one.

OpenRefine looks ideal for my purposes but it is so slow when clustering my data or creating a text facet that it is unusable. Apparently this is a known issue.

I looked around but couldn’t find another tool that would enable me to explore a data set of this size, cluster, eliminate dupes, look for anomalies, etc. Can anyone recommend something that might fit the bill?

2 Answers

  • What do you mean to cluster data around a column? You can use standard clustering models, such as k-Means clustering or Gaussian Mixture Models for an exploratory analysis of your dataset. However, I'm not sure this is what you're looking for, please let me know.

  • I often study a new dataset by employing some dimensionality reduction technique. The most common is PCA, but I don't recommend it since it can extract only latent factors that are linearly associated with your variables. You can use t-SNE models (available in sklearn), or, if you are familiar with Deep Learning, with Autoencoders for dimensionality reduction. Once your dataset is compressed/reduced, you can observe how different values-factors are distributed on it. Dimensionality reduction is also very important in case your dataset suffers from high multicollinearity.

  • You can remove duplicates using pandas' drop_duplicates() function, explained here.

Hope this helps, otherwise let me know.

Answered by Leevo on August 16, 2021

OpenRefine is the tool that you need. The clustering functionality is accessible without creating a text facet by using the "Cluster & Edit" function (under "Edit Cells") will take you directly to the clustering dialog. The clustering itself is actually pretty fast, but, until recently, there were scalability issues with displaying large numbers of clusters/items, but last summer I introduced a cap on the number of choices displayed which dramatically speeds things up. It went from ~200 seconds to ~10 seconds for my test case of 41K clusters containing 118K values. This will be available in OpenRefine 3.5, but in the meantime you can grab our snapshot releases.

If you do outgrow OpenRefine, you could implement the same algorithms that we use internally. String similarity libraries should be easy to find for the language of your choice, then the "only" problem is to figure out how to block your records so that your O(n^2) pairwise distance measurements complete in a reasonable amount of time AND you don't miss any valid matches. Because this is different for different data sets, OpenRefine gives the user control over choosing the blocking strategy, but for a fixed data set, you may be able to choose just one that works best.

Obviously, the degenerate case where the first field is the same is even easier and could even be managed using *nix tools.

Answered by Tom Morris on August 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP