Software Recommendations Asked by KSA on September 25, 2021
Is there any free/paid OCR library that able to capture the invoices data in PDF format?
Need to have a low error rate.
We need to take those data and do some further processing.
Take a look at this article: https://bitmiracle.com/blog/ocr-pdf-in-net
Basically, you need 2 tools:
Here is the sample code from the article above that uses Tesseract with paid Docotic.Pdf library:
using System;
using System.IO;
using System.Text;
using BitMiracle.Docotic.Pdf;
using Tesseract;
namespace OCR
{
public static class OcrAndExtractText
{
public static void Main()
{
// BitMiracle.Docotic.LicenseManager.AddLicenseData("temporary or permanent license key here");
var documentText = new StringBuilder();
using (var pdf = new PdfDocument("Partner.pdf"))
{
using (var engine = new TesseractEngine(@"tessdata", "eng", EngineMode.Default))
{
for (int i = 0; i < pdf.PageCount; ++i)
{
if (documentText.Length > 0)
documentText.Append("rnrn");
PdfPage page = pdf.Pages[i];
string searchableText = page.GetText();
// Simple check if the page contains searchable text.
// We do not need to perform OCR in that case.
if (!string.IsNullOrEmpty(searchableText.Trim()))
{
documentText.Append(searchableText);
continue;
}
// This page is not searchable.
// Save the page as a high-resolution image
PdfDrawOptions options = PdfDrawOptions.Create();
options.BackgroundColor = new PdfRgbColor(255, 255, 255);
options.HorizontalResolution = 300;
options.VerticalResolution = 300;
string pageImage = $"page_{i}.png";
page.Save(pageImage, options);
// Perform OCR
using (Pix img = Pix.LoadFromFile(pageImage))
{
using (Page recognizedPage = engine.Process(img))
{
Console.WriteLine($"Mean confidence for page #{i}: {recognizedPage.GetMeanConfidence()}");
string recognizedText = recognizedPage.GetText();
documentText.Append(recognizedText);
}
}
File.Delete(pageImage);
}
}
}
using (var writer = new StreamWriter("result.txt"))
writer.Write(documentText.ToString());
}
}
}
Answered by Vitaliy Shibaev on September 25, 2021
Syncfusion Essential PDF supports OCR by using the Tesseract open-source engine. With a few lines of code, a scanned paper document containing raster images is converted to a searchable and selectable document.
You can able to get the data from the invoice PDF or image using OCR processor in our Essential PDF. Please refer the below link for more details,
https://www.syncfusion.com/blogs/post/optical-character-recognition-in-pdf-using-tesseract-open-source-engine.aspx
You can download the OCR processor product setup here and find the required NuGet package from here.
Note: Essential PDF supports OCR process PDF document/image in ASP.NET Core platform.
The following code demonstrate how to get OCR’ed text from an existing invoice document,
//Initialize the OCR processor by providing the path of tesseract binaries
using (OCRProcessor processor = new OCRProcessor(@"TesseractBinaries"))
{
//Load a PDF document
PdfLoadedDocument lDoc = new PdfLoadedDocument("Input.pdf");
//Set OCR language to process
processor.Settings.Language = Languages.English;
//Process OCR by providing the PDF document and Tesseract data
string extractedText=processor.PerformOCR(lDoc, @"TessData");
//Save the OCR processed PDF document in the disk
lDoc.Save("Sample.pdf");
lDoc.Close(true);
}
Note: I am working for Syncfusion.
Answered by Sowmiya on September 25, 2021
The LEADTOOLS toolkit is a professional SDK that provides the ability to recognize multiple field types using OCR for detection and extraction.
If the invoice image or document you are extracting from has an organized and defined structure, you can use the LEADTOOLS Forms Recognition and Processing features (https://www.leadtools.com/sdk/forms) to create a single template with multiple data field defined in it to extract from multiple filled sheets. (Disclaimer: I am an employee of the vendor of this toolkit)
The code to extract text from an invoice master form template with defined fields would look like this:
using (RasterCodecs codecs = new RasterCodecs())
{
string masterFormRepository = @"Invoice Master Form Path";
string filledFormDirectory = @"Filled Invoice Path";
using (IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false))
{
ocrEngine.Startup(codecs, null, null, null);
IMasterFormsRepository repository = new DiskMasterFormsRepository(codecs, masterFormRepository);
using (AutoFormsEngine engine = new AutoFormsEngine(repository, ocrEngine, null, AutoFormsRecognitionManager.Default | AutoFormsRecognitionManager.Ocr, 30, 80, false))
{
foreach (var file in Directory.EnumerateFiles(filledFormDirectory))
{
using (RasterImage image = codecs.Load(file))
{
//Run the recognition
AutoFormsRunResult runResult = engine.Run(image, null, null, null);
}
}
}
}
}
Answered by Hussam Barouqa on September 25, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP