TransWikia.com

How to fetch text from pdf to further proceed with question answer based model from the same document?

Data Science Asked by Arijit Das on May 24, 2021

To illustrate the above title.

Suppose you have a pdf document, which is basically scanned from hardcopy, now there are set of fixed questions to answer from the document itself.
For an example a document contains a contract of land, now the set of fixed questions be “who is the seller?” “what is price of the asset? “, document has referred to this answers probably 2-3 times, as a human it’s a simple task.

How to automate this?

One Answer

You can use pypdf2 to extract text from pdf.

import PyPDF2

with open('sample.pdf','rb') as pdf_file, open('sample_output.txt', 'w') as text_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    for page_number in range(number_of_pages):   # use xrange in Py2
        page = read_pdf.getPage(page_number)
        print('Page No - ' + str(1 + read_pdf.getPageNumber(page)))
        page_content = page.extractText()
        text_file.write(page_content)

Answered by Musakkhir Sayyed on May 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP