Stack Overflow Asked by David Perea on December 19, 2020
I am using the pepa function to extract the paragraphs with the word "Artificial Intelligence" in pdf documents. However, I do not extract all the paragraphs with those words. I missed a lot less. It does not get to extract those from the ends of the document.
library(textreadr)
library(tidyverse)
library(pdfsearch)
dirct <- directory_path
result <- keyword_directory(dirct, keyword = 'Artificial Intelligence', split_pdf = TRUE, surround_lines = 0, full_names = TRUE)
For example, in this file:
https://www.telefonica.com/documents/153952/13347920/2019-Telefonica-Consolidated-Management-Report.pdf/0a9c8382-c9ff-ba52-1d5b-e431a7efab3f
I only get 22 mentions, however there are about 40 mentions of this keyword (Artificial Intelligence)
For what is this?
You might want to try grepl
Example for a dataframe:
data_frame <- read.csv2(...)
data_frame <- mutate(data_frame, columx = 0)
data_frame$columx[grepl("artificial intelligence", data_frame$columx, ignore.case = TRUE)] <- 1
as indicated by ignore.case
you should also consider intra-word-dashes etc..
When your source file is a PDF, try to create a Corpus (VCorpus
) and transform the Corpus to a Document Term Matrix DocumentTermMatrix
Answered by arndtupb on December 19, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP