NLP to detect duplicates for very technical language

Question

I have the following scenario, to detect duplicate products based on the description fields. The Description Field contains product technical name, dimensions, characteristics. My model needs to consider that different annotation and abbreviations might have been used for technical names, text errors in data entries,  similar/different dimensions or characteristics might still point to the same product. Therefore, I think that applying a normal fuzzy matching or other NLP text matching will not perform well in my case. I trying to approach this problem as a learning/supervised model, but still not sure how so any suggestion/idea is very appreciated.

Peter · Answer

If you think fuzzy matching does not work, you basically have two options:

Unsupervised: Try topic modeling to find „similar“ products. The problem is that you will need to pre-define the number of products (groups, aka „topics“), which can be a problem. Also topic modeling will only work well if there are sufficient differences in the descriptions. https://github.com/Bixi81/R-ml/blob/master/NLP_topic_modelling.R

An unsupervised approach seems to be a good option for you, judging from what you wrote about your problem.

Supervised: If you have a good set of descriptions with a label (e.g. product 1, product 2, etc), you can use some NLP model to predict which text belongs to which product. Not sure how much you know about it: You could start with a „Bag of Words“ and try things like Lasso, Boosting, or even Neural Nets on this.

Here is an example on how to apply Neural Nets to text classification. But be aware: in real world applications, text classification can be very tricky, especially if you have many classes (products in your case) and not so many observations per class (descriptions in your case).

malcubierre · Answer

Perhaps this approach can work with you:

Separate each description by sentence.
Topic Modeling. Retrieve the best 'n' topics, perhaps some Topics should be removed because they are not relevant to the product comparison (price, stock location)...
Recalculate new descriptions once removed sentences with useless topics
Do a similarity cosine using for example tfIdf, LSTM..., and remove items that are very near.

This is an example of this approach with accommodation descriptions: https://medium.com/@actsusanli/when-topic-modeling-is-part-of-the-text-pre-processing-294b58d35514

Mohamed ALI · Answer

I am using now Tidytext, dplyer and widyr packages to find duplicates I have a data frame contains product description named short_text, the column name is header, I add an index column, then tokenize the data with feature = sentences. I calculated the Term Frequency and Inverse Document Frequency TF-IDF.

I calculated the cosine similarity using pairwise_similar function.

st_sentences_weights <- short_text %>% mutate(text_ID = row_number()) %>% 
  unnest_tokens(input = header,
                output = "sentences",
                token = "sentences") %>% 
  count(text_ID, sentences, sort = TRUE) %>% 
  bind_tf_idf(text_ID, sentences, n)

sentences_similarity <- st_sentences_weights %>%   
  pairwise_similarity(text_ID, sentences, tf_idf) %>%
  arrange(desc(similarity)) 
```

NLP to detect duplicates for very technical language

3 Answers

Add your own answers!

Ask a Question