Stack Overflow Asked on November 27, 2021
After these steps:
library(quanteda)
df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
dfm()
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
How is it possible to find pairs or triples of words (ngram = 2:3) which exist in more than 5 documents?
ngrams need to be constructed before converting to dfm. Because the order of words in a dfm is lost.
The clean quanteda way then would be:
library(quanteda)
df <- data.frame(text = c("only a small text","only a small text","only a small text","only a small text","only a small text","only a small text","remove this word lower frequency"))
tdfm <- df %>%
corpus() %>% # when you have a data.frame it usually makes sense to construct a corpus first to retain the other columns as meta-data
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_ngrams(n = 2:3) %>% # construct ngrams
dfm() %>% # convert to dfm
dfm_trim(min_docfreq = 5) # select ngrams that appear in at least 5 documents
tdfm
#> Document-feature matrix of: 7 documents, 5 features (14.3% sparse).
#> features
#> docs only_a a_small small_text only_a_small a_small_text
#> text1 1 1 1 1 1
#> text2 1 1 1 1 1
#> text3 1 1 1 1 1
#> text4 1 1 1 1 1
#> text5 1 1 1 1 1
#> text6 1 1 1 1 1
#> [ reached max_ndoc ... 1 more document ]
Created on 2020-07-22 by the reprex package (v0.3.0)
If you want to create ngrams only from words which appear in more than 4 documents, I think it makes most sense to first construct a dfm without ngrams, filter terms that appear in more than 4 docs and use this dfm to subset the tokens before constructing ngrams (as no tokens_trim function exists):
# first construct dfm without ngrams and
dfm_onegram <- df %>%
corpus() %>%
dfm() %>%
dfm_trim(min_docfreq = 4)
dfm_ngram <- df %>%
corpus() %>%
tokens(remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_keep(featnames(dfm_onegram)) %>% # keep only tokens that appear in more than 4 docs (in the dfm_onegram object)
tokens_ngrams(n = 2:3) %>%
dfm() %>%
dfm_trim(min_docfreq = 5)
Keep in mind though that rare words will now be ignored when in ngrams. If you have the text "only a rare small text", the resulting ngram will still be "only_a_small".
Answered by JBGruber on November 27, 2021
As in the previous question, just expand what you are looking for.
tdfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
# 2 and 3 grams
tokens_ngrams(n = 2:3) %>%
dfm()
dfm_keep(tdfm, pattern = featnames(tdfm)[docfreq(tdfm) > 5])
Document-feature matrix of: 7 documents, 5 features (14.3% sparse).
features
docs only_a a_small small_text only_a_small a_small_text
text1 1 1 1 1 1
text2 1 1 1 1 1
text3 1 1 1 1 1
text4 1 1 1 1 1
text5 1 1 1 1 1
text6 1 1 1 1 1
text7 0 0 0 0 0
Answered by phiver on November 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP