Manually add text layer (OCR) over a scanned image

Question

I have a PDF consisting of scanned images of pages from an old printed book. (It has not been OCRed, so is not searchable.) Using the Google Cloud Vision API, one can perform OCR, and what's more interesting, get the position of the (bounding box for) each word. Now, using TeX/LaTeX (with any engine), is there a way to add these words to the PDF at the corresponding positions, i.e. manually add an (invisible) OCR/text layer to the PDF such that it's still the scanned image that's visible but the text can be selected and copied?
(I realize that as we're not using any of LaTeX's structured-document features, nor any of TeX's typesetting features—breaking paragraphs into lines, doing kerning etc—and are manually positioning text that will not even be visible, it may seem that TeX is not really needed for this job. But I don't know any other tool either: there are tools like tesseract that automatically do OCR and add the text, but I want control, to be able to choose what text goes where. There's probably a way to do it from within TeX/XeTeX/LuaTeX.)

Ulrike Fischer · Answer

You can use the transparent package to make text transparent. Copy&paste should work fine, but finding the text to copy is a bit more difficult ;-) transparent currently works with pdflatex and lualatex, in the next text live it will also work with (x)dvipdfmx.
documentclass{article}
usepackage{pdfpages,transparent}
usepackage{eso-pic}

AddToShipoutPictureFG{AtPageCenter{texttransparent{0}{Huge This is some text in the center}}}
begin{document}
includepdf[pages=1]{example-image-a}
end{document}

Manually add text layer (OCR) over a scanned image

One Answer

Add your own answers!

Ask a Question