TransWikia.com

Which phrase should be returned in case of multiple matches when comparing text?

Data Science Asked by Hefaz on December 30, 2020

I want to compare one sentence to some other sentences using the Bag of Words model. Suppose that my comparing sentence is:

I am playing football

and there are three more sentences that I want to compare my comparing sentence with. They are:

1. and I am playing Cricket

2. Why do you play Cricket

3. I love playing Cricket when I am at school

Now, if I compare my comparing sentence to the above three sentences by counting words, the number 1 and number 2 sentences have the same number of words that the comparing sentence has. and that is 3 (I, am , playing).

Now the question is, Which sentence is more related to my comparing sentence in this case? there are no semantic meanings involved at all.

In some places I saw, they say, it is less convoluted to return the shortest sentence in this case. What are your thoughts?

One Answer

This is usually done by carefully choosing two things:

  • The sentence representation. Word count is the most simple option but there can be many others: TFIDF weights, with/without removing stop words, with/without lemmatization, etc. In a DL approach the sentence would be represented as a sentence embedding.
  • The similarity measure between two sentences. Again there are many options, in BoW approaches the standard ones would include counting words in common (e.g. Jaccard) and cosine TFIDF.

So the answer is: it depends on the similarity score. A complex similarity score like cosine TFIDF rarely produces ties so the highest score can be selected. More simple methods give ties, and then the logical answer is to return all the tied sentences.

Answered by Erwan on December 30, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP