TransWikia.com

Training custom NER on OCR text with SpaCy won't train

Data Science Asked by Doxcos44 on May 24, 2021

I want to perform information extraction from documents. I wanted to try Spacy’s NER method, so I follow following steps :

1)OCR on text document, using Tesseract. As output I have a list of words with corresponding box on the document

  1. Reconstruct text based on the box coordinates of each word detected (to get full text of document)

  2. Tag each box with a custom entity

  3. Generate Training data adapted for Spacy NER (located in variable TRAIN_DATA)

I have 3000+ samples inside Training and 9 custom entities. Documents are in french. My code for training is :

from __future__ import unicode_literals,print_function
import plac
import random
from pathlib import Path
import spacy 
from tqdm import tqdm

model = None
import shutil 
n_iter = 100

if model is not None : 
     nlp = spacy.load(model)
     print(f"Loaded model {model}")
else :
    nlp = spacy.blank('fr')
    print('Import blank model')

if 'ner' not in nlp.pipe_names:
    print('creating ner pipeline')
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner,last = True)
else : 
    ner = nlp.get_pipe('ner')


for _,annotations in train_data:
    for ent in annotations.get('entities'): 
    ner.add_label(ent[2]) #ent is (start_span,end_span,entity_name)


other_pipes = [pipe for pipe in nlp.pipe_names if pipe!= ner]
with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()
    for itn in range(n_iter):
    random.shuffle(TRAIN_DATA)
    losses = {}
    for texts,annotations in tqdm(TRAIN_DATA):
        nlp.update([texts], [annotations], sgd=optimizer, drop=0.35, losses=losses)
    print("Losses", losses)

The problem is that at each iteration I have Losses {}. It seems the train is not working.

If someone have tips of have experience with ner on ocr output thanks you. The reconstructed text is quit long on average (2347 characters). I can’t provide example of training data.

In fact I think it is difficult to perform NER because my documents are not really structured with sentences (my documents are bills)

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP