Data Science Asked by William Falcon on February 21, 2021
Given a sentence like:
Complimentary gym access for two for the length of stay ($12 value per person per day)
What general approach can I take to identify the word gym or gym access?
If you're a R user, there is a lot of good practical information here. Look at their text mining examples.
Also, take a look at the tm package.
This is also a good aggregation site.
Answered by Michael Cox on February 21, 2021
You need to analyze sentence structure and extract corresponding syntactic categories of interest (in this case, I think it would be noun phrase, which is a phrasal category). For details, see corresponding Wikipedia article and "Analyzing Sentence Structure" chapter of NLTK book.
In regard to available software tools for implementing the above-mentioned approach and beyond, I would suggest to consider either NLTK (if you prefer Python), or StanfordNLP software (if you prefer Java). For many other NLP frameworks, libraries and programming various languages support, see corresponding (NLP) sections in this excellent curated list.
Answered by Aleksandr Blekh on February 21, 2021
Shallow Natural Language Processing technique can be used to extract concepts from sentence.
-------------------------------------------
Shallow NLP technique steps:
Convert the sentence to lowercase
Remove stopwords (these are common words found in a language. Words like for, very, and, of, are, etc, are common stop words)
Extract n-gram i.e., a contiguous sequence of n items from a given sequence of text (simply increasing n, model can be used to store more context)
Assign a syntactic label (noun, verb etc.)
Knowledge extraction from text through semantic/syntactic analysis approach i.e., try to retain words that hold higher weight in a sentence like Noun/Verb
-------------------------------------------
Lets examine the results of applying the above steps to your given sentence Complimentary gym access for two for the length of stay ($12 value per person per day)
.
1-gram Results: gym, access, length, stay, value, person, day
Summary of step 1 through 4 of shallow NLP:
1-gram PoS_Tag Stopword (Yes/No)? PoS Tag Description
-------------------------------------------------------------------
Complimentary NNP Proper noun, singular
gym NN Noun, singular or mass
access NN Noun, singular or mass
for IN Yes Preposition or subordinating conjunction
two CD Cardinal number
for IN Yes Preposition or subordinating conjunction
the DT Yes Determiner
length NN Noun, singular or mass
of IN Yes Preposition or subordinating conjunction
stay NN Noun, singular or mass
($12 CD Cardinal number
value NN Noun, singular or mass
per IN Preposition or subordinating conjunction
person NN Noun, singular or mass
per IN Preposition or subordinating conjunction
day) NN Noun, singular or mass
Step 4: Retaining only the Noun/Verbs we end up with gym, access, length, stay, value, person, day
Lets increase n to store more context and remove stopwords.
2-gram Results: complimentary gym, gym access, length stay, stay value
Summary of step 1 through 4 of shallow NLP:
2-gram Pos Tag
---------------------------
access two NN CD
complimentary gym NNP NN
gym access NN NN
length stay NN NN
per day IN NN
per person IN NN
person per NN IN
stay value NN NN
two length CD NN
value per NN IN
Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym, gym access, length stay, stay value
3-gram Results: complimentary gym access, length stay value, person per day
Summary of step 1 through 4 of shallow NLP:
3-gram Pos Tag
-------------------------------------
access two length NN CD NN
complimentary gym access NNP NN NN
gym access two NN NN CD
length stay value NN NN NN
per person per IN NN IN
person per day NN IN NN
stay value per NN NN IN
two length stay CD NN NN
value per person NN IN NN
Step 5: Retaining only the Noun/Verb combination we end up with complimentary gym access, length stay value, person per day
Things to remember:
Tools:
You can consider using OpenNLP / StanfordNLP for Part of Speech tagging. Most of the programming language have supporting library for OpenNLP/StanfordNLP. You can choose the language based on your comfort. Below is the sample R code I used for PoS tagging.
Sample R code:
Sys.setenv(JAVA_HOME='C:Program FilesJavajre7') # for 32-bit version
library(rJava)
require("openNLP")
require("NLP")
s <- paste("Complimentary gym access for two for the length of stay $12 value per person per day")
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
tagged_str <- tagPOS(s)
tagged_str
#$POStagged
#[1] "Complimentary/NNP gym/NN access/NN for/IN two/CD for/IN the/DT length/NN of/IN stay/NN $/$ 12/CD value/NN per/IN person/NN per/IN day/NN"
#
#$POStags
#[1] "NNP" "NN" "NN" "IN" "CD" "IN" "DT" "NN" "IN" "NN" "$" "CD"
#[13] "NN" "IN" "NN" "IN" "NN"
Additional readings on Shallow & Deep NLP:
Answered by Manohar Swamynathan on February 21, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP