Data Science Asked by Jab on May 7, 2021
I have the following scenario, to detect duplicate products based on the description fields. The Description Field contains product technical name, dimensions, characteristics. My model needs to consider that different annotation and abbreviations might have been used for technical names, text errors in data entries, similar/different dimensions or characteristics might still point to the same product. Therefore, I think that applying a normal fuzzy matching or other NLP text matching will not perform well in my case. I trying to approach this problem as a learning/supervised model, but still not sure how so any suggestion/idea is very appreciated.
If you think fuzzy matching does not work, you basically have two options:
Unsupervised: Try topic modeling to find „similar“ products. The problem is that you will need to pre-define the number of products (groups, aka „topics“), which can be a problem. Also topic modeling will only work well if there are sufficient differences in the descriptions. https://github.com/Bixi81/R-ml/blob/master/NLP_topic_modelling.R
An unsupervised approach seems to be a good option for you, judging from what you wrote about your problem.
Supervised: If you have a good set of descriptions with a label (e.g. product 1, product 2, etc), you can use some NLP model to predict which text belongs to which product. Not sure how much you know about it: You could start with a „Bag of Words“ and try things like Lasso, Boosting, or even Neural Nets on this.
Here is an example on how to apply Neural Nets to text classification. But be aware: in real world applications, text classification can be very tricky, especially if you have many classes (products in your case) and not so many observations per class (descriptions in your case).
Answered by Peter on May 7, 2021
Perhaps this approach can work with you:
This is an example of this approach with accommodation descriptions: https://medium.com/@actsusanli/when-topic-modeling-is-part-of-the-text-pre-processing-294b58d35514
Answered by malcubierre on May 7, 2021
I am using now Tidytext, dplyer and widyr packages to find duplicates I have a data frame contains product description named short_text, the column name is header, I add an index column, then tokenize the data with feature = sentences. I calculated the Term Frequency and Inverse Document Frequency TF-IDF.
I calculated the cosine similarity using pairwise_similar function.
st_sentences_weights <- short_text %>% mutate(text_ID = row_number()) %>%
unnest_tokens(input = header,
output = "sentences",
token = "sentences") %>%
count(text_ID, sentences, sort = TRUE) %>%
bind_tf_idf(text_ID, sentences, n)
sentences_similarity <- st_sentences_weights %>%
pairwise_similarity(text_ID, sentences, tf_idf) %>%
arrange(desc(similarity))
```
Answered by Mohamed ALI on May 7, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP