Extracting sections from document based on list of keywords - Python

Question

I am new to NLP and I would like to ask how can I extract sentences from the text based on keywords that I have using Python. I created a list of keywords which will be used to extract sentences from the document.

If this will be a simple tokenization problem in which you will loop the list through the tokens, how can I capture synonyms or related words?

For example:

Keyword: Internal business

Sentence: You can only use this software for your business only.

Keyword: Confidentiality

Sentence: Information will be kept as secure as possible.

I actually implemented text categorization using TF-IDF, but with small dataset and large number of keywords. I don't think this will work to.
Thanks in advance.

Is it possible to apply pre-trained models like word2vec?

Is it also possible to create a custom model that will fit my concerns?

Gyan Ranjan · Answer

The ideal way to get the related sentences would be to try to get a sentence vector for the sentences you want to categorise and then compare the vectors of your predefined keywords with the obtained sentence vectors .
You can get the sentence vectors by just averaging the word vectors of the words present in the sentences . Once the sentence vectors are obtained , you can use cosine similarity to compare the keyword vectors and the sentence vectors . The one with the max cosine similarity will give you the result .

Extracting sections from document based on list of keywords - Python

One Answer

Add your own answers!

Ask a Question