Data Science Asked by Doxcos44 on May 24, 2021
I want to perform information extraction from documents. I wanted to try Spacy’s NER method, so I follow following steps :
1)OCR on text document, using Tesseract. As output I have a list of words with corresponding box on the document
Reconstruct text based on the box coordinates of each word detected (to get full text of document)
Tag each box with a custom entity
Generate Training data adapted for Spacy NER (located in variable TRAIN_DATA)
I have 3000+ samples inside Training and 9 custom entities. Documents are in french. My code for training is :
from __future__ import unicode_literals,print_function
import plac
import random
from pathlib import Path
import spacy
from tqdm import tqdm
model = None
import shutil
n_iter = 100
if model is not None :
nlp = spacy.load(model)
print(f"Loaded model {model}")
else :
nlp = spacy.blank('fr')
print('Import blank model')
if 'ner' not in nlp.pipe_names:
print('creating ner pipeline')
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner,last = True)
else :
ner = nlp.get_pipe('ner')
for _,annotations in train_data:
for ent in annotations.get('entities'):
ner.add_label(ent[2]) #ent is (start_span,end_span,entity_name)
other_pipes = [pipe for pipe in nlp.pipe_names if pipe!= ner]
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
for texts,annotations in tqdm(TRAIN_DATA):
nlp.update([texts], [annotations], sgd=optimizer, drop=0.35, losses=losses)
print("Losses", losses)
The problem is that at each iteration I have Losses {}. It seems the train is not working.
If someone have tips of have experience with ner on ocr output thanks you. The reconstructed text is quit long on average (2347 characters). I can’t provide example of training data.
In fact I think it is difficult to perform NER because my documents are not really structured with sentences (my documents are bills)
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP