TransWikia.com

General approach to extract key text from sentence (nlp)

Data Science Asked by William Falcon on February 21, 2021

Given a sentence like:

Complimentary gym access for two for the length of stay ($12 value per person per day)

What general approach can I take to identify the word gym or gym access?

3 Answers

If you're a R user, there is a lot of good practical information here. Look at their text mining examples.
Also, take a look at the tm package.
This is also a good aggregation site.

Answered by Michael Cox on February 21, 2021

You need to analyze sentence structure and extract corresponding syntactic categories of interest (in this case, I think it would be noun phrase, which is a phrasal category). For details, see corresponding Wikipedia article and "Analyzing Sentence Structure" chapter of NLTK book.

In regard to available software tools for implementing the above-mentioned approach and beyond, I would suggest to consider either NLTK (if you prefer Python), or StanfordNLP software (if you prefer Java). For many other NLP frameworks, libraries and programming various languages support, see corresponding (NLP) sections in this excellent curated list.

Answered by Aleksandr Blekh on February 21, 2021

Shallow Natural Language Processing technique can be used to extract concepts from sentence.

-------------------------------------------

Shallow NLP technique steps:

  1. Convert the sentence to lowercase

  2. Remove stopwords (these are common words found in a language. Words like for, very, and, of, are, etc, are common stop words)

  3. Extract n-gram i.e., a contiguous sequence of n items from a given sequence of text (simply increasing n, model can be used to store more context)

  4. Assign a syntactic label (noun, verb etc.)

  5. Knowledge extraction from text through semantic/syntactic analysis approach i.e., try to retain words that hold higher weight in a sentence like Noun/Verb

-------------------------------------------

Lets examine the results of applying the above steps to your given sentence Complimentary gym access for two for the length of stay ($12 value per person per day).

1-gram Results: gym, access, length, stay, value, person, day

Summary of step 1 through 4 of shallow NLP:

1-gram          PoS_Tag   Stopword (Yes/No)?    PoS Tag Description
-------------------------------------------------------------------    
Complimentary   NNP                             Proper noun, singular
gym             NN                              Noun, singular or mass
access          NN                              Noun, singular or mass
for             IN         Yes                  Preposition or subordinating conjunction
two             CD                              Cardinal number
for             IN         Yes                  Preposition or subordinating conjunction
the             DT         Yes                  Determiner
length          NN                              Noun, singular or mass
of              IN         Yes                  Preposition or subordinating conjunction
stay            NN                              Noun, singular or mass
($12            CD                              Cardinal number
value           NN                              Noun, singular or mass
per             IN                              Preposition or subordinating conjunction
person          NN                              Noun, singular or mass
per             IN                              Preposition or subordinating conjunction
day)            NN                              Noun, singular or mass

Step 4: Retaining only the Noun/Verbs we end up with gym, access, length, stay, value, person, day

Lets increase n to store more context and remove stopwords.

2-gram Results: complimentary gym, gym access, length stay, stay value

Summary of step 1 through 4 of shallow NLP:

2-gram              Pos Tag
---------------------------
access two          NN CD
complimentary gym   NNP NN
gym access          NN NN
length stay         NN NN
per day             IN NN
per person          IN NN
person per          NN IN
stay value          NN NN
two length          CD NN
value per           NN IN

Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym, gym access, length stay, stay value

3-gram Results: complimentary gym access, length stay value, person per day

Summary of step 1 through 4 of shallow NLP:

3-gram                      Pos Tag
-------------------------------------
access two length           NN CD NN
complimentary gym access    NNP NN NN
gym access two              NN NN CD
length stay value           NN NN NN
per person per              IN NN IN
person per day              NN IN NN
stay value per              NN NN IN
two length stay             CD NN NN
value per person            NN IN NN


Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym access, length stay value, person per day

Things to remember:

  • Refer the Penn tree bank to understand PoS tag description
  • Depending on your data and the business context you can decide the n value to extract n-grams from sentence
  • Adding domain specific stop words would increase the quality of concept/theme extraction
  • Deep NLP technique will give better results i.e., rather than n-gram, detect relationships within the sentences and represent/express as complex construction to retain the context. For additional info, see this

Tools:

You can consider using OpenNLP / StanfordNLP for Part of Speech tagging. Most of the programming language have supporting library for OpenNLP/StanfordNLP. You can choose the language based on your comfort. Below is the sample R code I used for PoS tagging.

Sample R code:

Sys.setenv(JAVA_HOME='C:Program FilesJavajre7') # for 32-bit version
library(rJava)
require("openNLP")
require("NLP")

s <- paste("Complimentary gym access for two for the length of stay $12 value per person per day")

tagPOS <-  function(x, ...) {
  s <- as.String(x)
    word_token_annotator <- Maxent_Word_Token_Annotator()
    a2 <- Annotation(1L, "sentence", 1L, nchar(s))
    a2 <- annotate(s, word_token_annotator, a2)
    a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
    a3w <- a3[a3$type == "word"]
    POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
    POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
    list(POStagged = POStagged, POStags = POStags)
  }
  
  tagged_str <-  tagPOS(s)
  tagged_str

#$POStagged
#[1] "Complimentary/NNP gym/NN access/NN for/IN two/CD for/IN the/DT length/NN of/IN stay/NN $/$ 12/CD value/NN per/IN     person/NN per/IN day/NN"
#
#$POStags
#[1] "NNP" "NN"  "NN"  "IN"  "CD"  "IN"  "DT"  "NN"  "IN"  "NN"  "$"   "CD" 
#[13] "NN"  "IN"  "NN"  "IN"  "NN" 

Additional readings on Shallow & Deep NLP:

Answered by Manohar Swamynathan on February 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP