What algorithm to use for extracting information from bank statements

Question

I am trying to find the best way to extract information from bank statements. A bank transaction is not a natural text but still human readable.

I would like to extract a bunch of data if present like payment method, date, amount, vendor/customer name and even information like an order/invoice ID or the reason. I have sets of data that I can use for the training, for instance (100k+ vendors, payment methods words, etc). Also the solution must work with multi languages. The bank transactions could be in different languages (not only english).

Is named entity recognition the best way to go?

here is some sample data I can have in input:

Dividend
POS PURCHASE Non-PIN: RCI*RINGCENTRAL, INC.
IMP.DEB.LEY 83652 - ALICUOTA GENERA
DIRECTDEBIT EQUITY COMMERCE MERCH FEES 158713 62000329526385 CCD
ADVANCE AUTO PARTS JACKSONVILLE

Thanks for your help

vsdaking · Answer

Am currently working on something in this domain.

The rough process I am currently following is -

Extract data from PDFs (ubiquitous version of Bank Statements nowadays) into more usable formats. Currently,converting them to TXT files first as an intermediate step.
Generically, bank statements (from a specific bank) tend to be structured in the same format. Hence, while converting the TXT to CSV, you can structure the algo such that it knows what to pick up - based on a rough analysis of the TXT file.

Use 2 different data frames to store the statements content - one df keeping track of the transactional data (storing 3 values - Date, Description and Amount) and the other to keep track of the metadata (storing 2 values - Key and Value)
You can determine the rough structure of the TXT file by going through a few TXTs. e.g. the transactions generally begin after a specific element's occurrence in every bank statement. Ignore the page footers / headers that recur in case of multi-page transactions. Also, another observation would be that all data in a specific column of the transactions (the transactions are read in as a table by my program [link given below]) appear together. 
Doing the above enables the transaction and meta data to a relatively structured and processable format

Perform analysis on the Description column or the entire transactional df you have thus generated.

You can use the following code to convert a given PDF to a txt file -

from __future__ import unicode_literals
import os
import sys  
reload(sys)  
sys.setdefaultencoding('utf8')
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO

def pdfparser(data):
    dest = data[:-3]+"txt"    
    print "nnnnn",dest    
    fp = file(data, 'rb')
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    # Process each page contained in the document.
    for page in PDFPage.get_pages(fp):
        interpreter.process_page(page)
        data =  retstr.getvalue()
        data = unicode(data,'utf-8', errors='ignore')
    #print data
    # write data to a file
    print data
    with open(dest, "w") as f:
        f.write(data)

# set the working directory
path = "F:/banking/payments/"
os.chdir(path)
fls=os.listdir(path)

for x in fls:
    if x[-3:]=="pdf":
        pdfparser(path+x)

Phililippe · Answer

I got good results by treating this question as a classification problem using Embeddings (Glove 50 for words embeddings) and bidirectional LSTM. I know this problem looks more an Entity Recognition problem, but in my use case, I only need to classify a known subset of merchants, so it works well. As the training data was very unbalanced, I also used data-synthesis to boost the accuracy.

My Keras model :

__________________________________________________________________________________________________ 
Layer (type)                    Output Shape         Param #     Connected to                     
================================================================================================== 
words_input (InputLayer)        (None, None)         0          
__________________________________________________________________________________________________ 
casing_input (InputLayer)       (None, None)         0          
__________________________________________________________________________________________________ 
embedding_1 (Embedding)         (None, None, 50)     20000000    words_input[0][0]                
__________________________________________________________________________________________________ 
embedding_2 (Embedding)         (None, None, 9)      81          casing_input[0][0]               
__________________________________________________________________________________________________ 
concatenate_1 (Concatenate)     (None, None, 59)     0           embedding_1[0][0]                
                                                                 embedding_2[0][0]                
__________________________________________________________________________________________________ 
bidirectional_1 (Bidirectional) [(None, 400), (None, 416000      concatenate_1[0][0]              
__________________________________________________________________________________________________ 
dense_1 (Dense)                 (None, 1591)         637991      bidirectional_1[0][0]            
================================================================================================== 
Total params: 21,054,072 Trainable params: 1,053,991 Non-trainable params: 20,000,081
__________________________________________________________________________________________________

What algorithm to use for extracting information from bank statements

2 Answers

Add your own answers!

Ask a Question