Are there deduplication algorithms that do not work on a metric space?

Question

Recently I got interested in the process of data cleansing and specifically in record linkage.

Thus far I read about deterministic and probabilistic approaches to deduplicate data sets and to some lesser degree also about machine learning methods. It struck me that the key part of all algorithm basically introduce a metric space. Through the metric space every two data points can be assigned a distance. The distance is then basically a measure of how close these two data points are related to another.

However I do wonder, if there were not also different kinds of algorithms that do not work on this principle?

Brian Spiering · Answer

One option is fingerprinting. If two objects have the same fingerprint, they are probably the same object. Depending the technique used, the fingerprint can not tell about approximate duplicates.

Are there deduplication algorithms that do not work on a metric space?

One Answer

Add your own answers!

Ask a Question