TransWikia.com

Invisible text even when selected (in Evince)

TeX - LaTeX Asked by Invitor on March 15, 2021

When you process an image with an OCR tool like tesseract, you will get a PDF file with that image and the text invisible on top. You can’t see the text, but you can select, search and copy it. When you select such text in an OCR generated PDF file it looks (in Evince) like this.

created by tesseract

You only see the image and the highlighting of the text. The selected text does not appear (only its highlighting). In Evince the highlighting for this type of text is transparent. When I try to create something like this by hand in latex, for example with the transparent package or text rendering mode 3 (see code below), I get invisible text, but the highlighting appears different (in Evince).

latex result

When selected, the text becomes visible in the highlighting and the highlighting is not transparent as in the file generated by tesseract. It seems that the texts in both files are marked as different content types (or something similar), so they are highlighted differently.

TL;DR

How can I create text like tesseract that is invisible, searchable, selectable and has that special highlighting (transparent, without text appearing (in Evince)), so you can see an image behind it even when the text is selected?

Example

documentclass{article}
usepackage{transparent}
begin{document}
fbox{Invisible 1: {transparent{0}{invisible}}}

fbox{Invisible 2: {pdfliteral page{q 3 Tr}invisiblepdfliteral page{Q}}}
end{document}

2 Answers

The accsupp package allows the typeset text to be different from the copy/paste text. See SUPPLEMENT for alternative.

documentclass{article}
usepackage{accsupp}
newcommandnosee[1]{%
  BeginAccSupp{method=escape,ActualText={detokenize{#1}}}%
[phantom{#1}]%
EndAccSupp{}%
}
begin{document}
Now you see menosee{, now you don't}.
end{document}

The PDF visual output is

enter image description here

If I Ctl-A, Ctl-C to copy the whole document contents and paste, the resulting text is

Now you see me, now you don't .

This latter text is searchable. And I now allocate bracketed space equal to the text that is invisible. If you highlight between the brackets, the text is markable, but only in full. That is to say, one can either highlight the complete hidden text, or none of it. The present approach does not allow a portion of the hidden text to be highlighted.

SUPPLEMENT

Here, I use tokcycle to process each token of the invisible input stream through nosee separately, thus allowing partial/continuous text highlighting. The downside here is that the environment is intended for pure text, and any macros will only be detokenized to the cut/paste output.

This approach has the added advantage of providing line breaking.

Bracketed delimiting of the output is no longer needed...I instead output a white textunderscore for each character or space encountered.

documentclass{article}
usepackage[T1]{fontenc}
usepackage{accsupp,xcolor,lmodern,tokcycle}
newcommandnosee[1]{%
  sbox0{detokenize{#1}}%
  BeginAccSupp{method=escape,ActualText={detokenize{#1}}}%
makebox[wd0]{textcolor{white}{textunderscore}}allowbreak%
EndAccSupp{}%
}
tokcycleenvironmentinvisible
{addcytoks{nosee{##1}}}
{processtoks{##1}}
{addcytoks{nosee{##1}}}
{addcytoks{nosee{ }}}
begin{document}
Now you see meinvisible , now you don'tendinvisible.
end{document}

Answered by Steven B. Segletes on March 15, 2021

That "transparency selection" feature was implemented in Poppler, the pdf library used by Evince and Okular. The motivation was to correctly support documents created by tesseract: see original issue and implementation commit.

As you can see, a word is considered invisible when it is painted in "Text rendering mode 3", that special "no paint" mode originates from the PDF spec, as it's well explained in this answer.

So, it's a matter that the PDF producer software uses Tr 3 when inserting text in a pdf.

Answered by Nelson on March 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP