Data Science Asked by Arijit Das on May 24, 2021
To illustrate the above title.
Suppose you have a pdf document, which is basically scanned from hardcopy, now there are set of fixed questions to answer from the document itself.
For an example a document contains a contract of land, now the set of fixed questions be “who is the seller?” “what is price of the asset? “, document has referred to this answers probably 2-3 times, as a human it’s a simple task.
How to automate this?
You can use pypdf2 to extract text from pdf.
import PyPDF2
with open('sample.pdf','rb') as pdf_file, open('sample_output.txt', 'w') as text_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
for page_number in range(number_of_pages): # use xrange in Py2
page = read_pdf.getPage(page_number)
print('Page No - ' + str(1 + read_pdf.getPageNumber(page)))
page_content = page.extractText()
text_file.write(page_content)
Answered by Musakkhir Sayyed on May 24, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP