TransWikia.com

Identify specify areas in the text

Data Science Asked by Peter Krejzl on March 24, 2021

I’d be interested in identifying various areas in the text message. Let’s say I have a text containing some introduction, then there is a poem and at the end there are some urls to some web pages.
I’d like to be able to break down the text into these sections and process them separately.
I should be able to collect quite a few training data for each of the sections.
Any ideas or references to papers would be highly appreciated!
Thanks

One Answer

If you are still looking for answer then below are some suggestions. Please keep in mind that none of these will be 100% accurate:

  1. Train a classifier like RandomForest etc, on different known structures (like Poems, URLs etc). And then divide your input document in sentences and then pass sentences to the classifier. Classifier would give type of each sentence which you can mark as beginning of a type.
  2. Use Bag of words (BOW) as mentioned in comment above. Based on bag of words you can find each sentence type.
  3. Using same BOW approach you can do the clustering of sentences (KMeans for example) and you will get bunch of sentences clustered together.
  4. Neural network based classificatoin you can opt if you have enough data to train.

Above are not all possible options. If you have found a better option then please post.

Answered by Sandeep Bhutani on March 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP