Data Science Asked on December 4, 2021
I am new to ML and still learning it.
My problem is to identify duplicate products. I have a dataset containing product details such as name, colour, size, description, features etc (there are roughly 70 columns).
I need to remove duplicate products.
I just completed some of the supervised ML model(classification and regression) and unsupervised clustering(K means and HC). I am also on the way of learning w2v and d2v.
But due to time constrain, I need to deliver a solution to the above problem statement. I am unsure as to how to proceed.
Any help and guidance would be appreciated
You can do a K Means Clustering to see cluster your products and see if some products is situated very closely. (In the same cluster). Then you can say that products in the same cluster are similar. But you have to find the optimal k value of clusters.
Answered by SrJ on December 4, 2021
This problem is called record linkage, there are various techniques which can be used, usually involving some distance measure between record and/or approximate string matching between string fields.
Fyi it's a quite complex problem, especially if quality deduplication is expected and the volume of data is high.
Answered by Erwan on December 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP