Extract Information from PDF using DL

Data Science Asked by Rajesh Sharma on December 8, 2020

We are having this requirement of extracting information from a credit history document. Usually it is a PDF and a computer generated document.
Because these PDFs are generated by different sources, the layout of the document will be different for each source. The columnn header labels will also be different.
Presently, there are 4 sources which are generating this document, but going forward, it will be from many sources. From each of these documents, we will need to extract information such as lender name, lending amount, outstanding balance etc;

I need to know what are the steps and practical approach involved in extracting the data I want such as lender name, amount, balance etc;

Do we have an established Machine Learning / Deep Learning approach that can be implemented here? Just getting to know the basics of ML/DL, therefore need a direction please

One Answer

The task is doable, but time consuming and not easy. This is how I would plan the work:

  1. Write a PDF scraper that navigates the document and converts all the informations it contains into some standardized textual format.

  2. Label the words/elements of this text that correspond to the informations that you are looking for ("lender name", "lending amount", "outstanding balance", etc.).

  3. Run some NLP model, such as an RNN classifier or CRF, to extract information from text.

Steps 1 and 2 are very time consuming (2 more than 1), but it's certainly doable. It will be a lot of work, especially labeling observations to create the Training set, but a nice thing to put on your CV.

Answered by Leevo on December 8, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP