How can we extract fields from images?

Question

I am making an document parser which extracts data fields from the documents and store them in a structured way. Each field in my dataset is horizontal which is easy to extract.

But the model fails on following type of example -

Is there any way to extract invoice number and date from such images.

qNin · Accepted Answer

I have a similar use-case and a working product based on tensorflow object-detection api and pytesseract for OCR. On top of the extracted text, I perform regex for validation of the extracted information and cleaning it to meet requirements of other processes.

Steps:
1. Annotate images with some tool like labelimg.
I annotated a set of 1K images, similar to yours, with 23 different classes. The dataset is unbalanced with some classes appearing almost in every image to some classes appearing in only as few as 60. However, there are ways to ensure that this imbalance does not affect the performance of the network.
2. Choose a model from tf model zoo (I use this frcnn model) and retrain the last two layers using transfer learning.
3. Export the inference graph, perform object detection to identify the region of interest, and run OCR on the region of interest to extract the text.
I'd recommend storing the extracted data in a dictionary with class of the object as key and the extracted text as value.
4. Finally, have regex validate the text in the extracted field and perform any manipulation/transformation that is necessary.

The trained model can be deployed to production with help of tfserving. The same trained network can be deployed into a mobile app as well - look for tutorials on tensorflowlite for this.

Hope my answer helps you! I had a tough (but interesting) time gathering the knowledge required to get a production grade product that currently serves hundreds of request everyday. I would recommend reading completely all the links I have shared in this answer, and feel free for more questions. Good luck!

Pieter21 · Answer

I think you already have some OCR in place? I don't know if you also have the x-y locations and size of the recognized texts?

I hope you have a model that knows (has learned) occurrences of 'invoice #' as a label.

And maybe you can machine learn to recognize values that could be invoice numbers. 2034, 200.00 could be invoice numbers, 'Date' and 'Service fee' not.

You could machine learn relations between objects, probably with the help of a distance function.

I would say that a string value that contains mostly digits, is near a label that matches 'invoice #', and also has a similar size, is the most likely invoice number.

564 could be an invoice number, but it is too far away from invoice # (further than 2034).

'Date' is close to invoice number, but it does not match an expected string for invoice numbers, since it is mostly letters.

Vikas Bhandary · Answer

I would suggest that you should use a pre-trained OCR model and train your own custom model which only outputs required data.

Training method:

Just use a pre-trained OCR model like this, and remove the tail of the model and add your custom output layer with the required number of fields (in your it's case invoice and date). After this, freeze the head of the model and train your custom model with the data you have.

Note:

The accuracy of the model can be improved if you train this custom model by using as many different invoice templates as possible. Because this custom OCR model will have to learn to figure out the position of the invoice and date by itself.
If the model's output is incorrect for certain template you can always generate your own (synthetic) data by editing the template and add many examples of that particular template.

Using the pre-trained model, you can get pretty decent results with less training data. If you haven't used a pre-trained model, here is a more generalized way to use style transfer in PyTorch. I hope it will help you.

Pieter21 · Answer

A completely different answer:
I am currently following a course Computer Vision and Image Analysis.
With your problem in mind you could follow along. Depending on previous knowledge you could skip a few sections. (I skipped immediately to Beyond Classification/Object Detection)
Globally, you could train an image classification model that could recognize regions of interest, with a classification of the content. Where the course addresses people, cars, buses, in your problem you have images, labels, content (of various types). You may need to experiment, and maybe have 'label-value-pair' as a class, or even 'label-multiline'.
Or maybe labels and values separately work better? Or even a third option that you identify all combinations 'label' 'value' 'label-value'
You should probably define a low-wide boundary box for horizontal label/value pairs, and a more square one for vertical aligned label/value pairs.
You should end up with labeled regions of interest.
If you are happy with these regions, then the second step could be OCR. For the OCR you could use a similar problem division to recognize separate characters, and label the separate characters. And then you still have to combine the characters to words or values.

Pluviophile · Answer

More or Less this would be helpful
Link: Extracting information from documents
Approach & Algorithm from the above blog
Approach
The algorithm looks for phrases that look like a date. Then it picks the one which appears in the highest position in the document. In the corpus we used, almost every date contained the month written as a word (e.g. April), the day written in digits (13) followed by the year (1994). Sometimes, the day was printed before the month (e.g. 4th September, 1984). The algorithm looks for the patterns M D Y and D M Y where M is a month given as a word, D is a number representing the day and Y a number representing a year.
Software Tools
Our implementation runs in a Jupyter Notebook with Python 3. We use Tesseract version 4, for doing OCR through the wrapper pytesseract. Since the software sometimes gets a letter of the month wrong (e.g., duly instead of July), we accept all strings which almost look like a month in the sense that only a few letters need to be changed to reach a valid month. The number of these operations is called the Levenshtein distance, a common string metric in natural language processing (NLP). For example, the Levenshtein distance of duly and July is 1. Similarly for Moy, Septenber or similar errors. We use python-Levenshtein. For detecting numbers (years and days), we use regular expressions. We process all the tables in Pandas and use tqdm to have a neat progress bar.
Algorithm step by step
Similar Questions from Stackoverflow:

How to Extract Information from the Image
How to Extract Information from the Image(PNG)

How can we extract fields from images?

5 Answers

Approach

Software Tools

Add your own answers!

Ask a Question