TransWikia.com

Python function to loop through PDFs in a folder, and find keywords

Stack Overflow Asked by Michael H on January 1, 2021

thank you so much for taking your time. Please see code below. The code works, but instead of searching for one word, I need to search for several words. I’ve tried:

search_word = [‘python’ , ‘aws’ , ‘sql’]

but this doesn’t work. Any ideas on how to make this work?

Any suggestions to improve the code are all welcome!

Code:

directory = r"/Users/resumes_for_testing/"

# define keywords
search_word = 'python'

# Loop through all PDFs in specified directory:
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        # open the pdf file
        f = open(filename,'rb')
        object = PyPDF2.PdfFileReader(f)
        
        # search for keywords
        for i in range(object.numPages):
            page = object.getPage(i)
            text = page.extractText()
            search_text = text.lower().split()
            for word in search_text:
                if search_word in word:
                    print("The word '{}' was found in '{}'".format(search_word,filename))

2 Answers

Try pdfreader to extract texts:

import os
from pdfreader import SimplePDFViewer, PageDoesNotExist

def search_in_file(fname, search_words):
    fd = open(fname, "rb")
    viewer = SimplePDFViewer(fd)
    try:
        while True:
            viewer.render()
            text = "".join(viewer.canvas.strings)
            for word in search_words:
                if word in text:
                    print("The word '{}' was found in '{}' on page {}".format(word, fname, viewer.current_page_number))
            viewer.next()
    except PageDoesNotExist:
        pass

# define keywords
search_words = ['python', 'aws', 'sql']

# define directory
directory = "./"

# Loop through all PDFs in specified directory:
for fname in os.listdir(directory):
    if fname.endswith(".pdf"):
        search_in_file(fname, search_words)

Answered by Maksym Polshcha on January 1, 2021

You could try small change in approach where instead of looping the search_text you could loop through your list of search_words and then use if statement to see whether it is in search_text

e.g.

# define keywords
search_words = ['python', 'aws', 'sql']

# Loop through all PDFs in specified directory:
for filename in os.listdir(directory):
    if filename.endswith(".pdf"):
        # open the pdf file
        f = open(filename,'rb')
        object = PyPDF2.PdfFileReader(f)
    
        # search for keywords
        for i in range(object.numPages):
            page = object.getPage(i)
            text = page.extractText()
            search_text = text.lower().split()

            for word in search_words:
                if word in search_text:
                    print("The word '{}' was found in '{}'".format(word, filename))

Answered by Matthew King on January 1, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP