Data Science Asked by icl1c on October 10, 2020
I am trying to find the best way to extract information from bank statements. A bank transaction is not a natural text but still human readable.
I would like to extract a bunch of data if present like payment method, date, amount, vendor/customer name and even information like an order/invoice ID or the reason. I have sets of data that I can use for the training, for instance (100k+ vendors, payment methods words, etc). Also the solution must work with multi languages. The bank transactions could be in different languages (not only english).
Is named entity recognition the best way to go?
here is some sample data I can have in input:
Thanks for your help
Am currently working on something in this domain.
The rough process I am currently following is -
Generically, bank statements (from a specific bank) tend to be structured in the same format. Hence, while converting the TXT to CSV, you can structure the algo such that it knows what to pick up - based on a rough analysis of the TXT file.
Perform analysis on the Description column or the entire transactional df you have thus generated.
You can use the following code to convert a given PDF to a txt file -
from __future__ import unicode_literals
import os
import sys
reload(sys)
sys.setdefaultencoding('utf8')
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
def pdfparser(data):
dest = data[:-3]+"txt"
print "nnnnn",dest
fp = file(data, 'rb')
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
data = unicode(data,'utf-8', errors='ignore')
#print data
# write data to a file
print data
with open(dest, "w") as f:
f.write(data)
# set the working directory
path = "F:/banking/payments/"
os.chdir(path)
fls=os.listdir(path)
for x in fls:
if x[-3:]=="pdf":
pdfparser(path+x)
Answered by vsdaking on October 10, 2020
I got good results by treating this question as a classification problem using Embeddings (Glove 50 for words embeddings) and bidirectional LSTM. I know this problem looks more an Entity Recognition problem, but in my use case, I only need to classify a known subset of merchants, so it works well. As the training data was very unbalanced, I also used data-synthesis to boost the accuracy.
My Keras model :
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
words_input (InputLayer) (None, None) 0
__________________________________________________________________________________________________
casing_input (InputLayer) (None, None) 0
__________________________________________________________________________________________________
embedding_1 (Embedding) (None, None, 50) 20000000 words_input[0][0]
__________________________________________________________________________________________________
embedding_2 (Embedding) (None, None, 9) 81 casing_input[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, None, 59) 0 embedding_1[0][0]
embedding_2[0][0]
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) [(None, 400), (None, 416000 concatenate_1[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 1591) 637991 bidirectional_1[0][0]
==================================================================================================
Total params: 21,054,072 Trainable params: 1,053,991 Non-trainable params: 20,000,081
__________________________________________________________________________________________________
Answered by Phililippe on October 10, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP