TeX - LaTeX Asked on April 7, 2021
Note: The problem discussed in this question might affect not just Devanagari, but also other scripts that have character (logical/typed) & glyph (presentational/typeset) order reversed/reordered for some subset of character sequence.
In Devanagari, which is an abugida writing system, some vowel-notations (like accents in Latin alphabet) that are typed after a consonant can modify the shape of that consonant such that it occupies space to its left while pushing the consonant glyph to its right (even though its a left-to-right writing/typing system).[1] LuaTeX produces a pdf that is good for reading, and printing. Though it doesn’t always[2] produce a pdf (for Devanagari) that is good for searching or copying text to a text editor, even in the simplest Hello World
-type test (which I have added below). Note: This question is not necessarily just about copying/searching text correctly, please read further before posting a reply. I discovered this while testing my, and user michal-h21‘s text extraction techniques to extract text from a TeX box. The technique is simple: After a box has been set by TeX, traverse its nodelist to find and concatenate all unicode characters in order to get the unicode string of text set in the box.
Let’s take an example where vowel-notation typed after consonant ‘appears’ to precede the consonant: Hello पिताजी
. In this text, the first consonant-vowel pair पि
is typed in following order: प
(consonant), then ि
(vowel notation); though as you can see the resultant text (correctly) appears to have the presentational order reversed: पि
(as if ि
preceded प
). You can try this in your text editors to see the magic. Now let’s discuss the options to typeset this Hello पिताजी
text in a pdf using LuaLaTeX, and fontspec
package, and the related problems of copying/searching & text extraction. Font used in the example below is Noto Sans Devanagari, its available for download here. I am using Adobe Acrobat Reader DC (free) to try copying from and searching the pdf as not all pdf readers have good support for non-Latin scripts (you might encounter problems copy-pasting non-Latin text from other pdf readers).
Fontspec with Renderer=Node
(default value for Renderer
): This mode seems particularly broken for copying/searching, and am not sure if there is any way to decipher real glyph order of Devanagari text. For our test text Hello पिताजी
, if you copy and paste the text (from produced pdf) to a text editor, it will show it as Hello िपताजी
. The root of the problem is that, in the internal nodelist representation for this text, TeX smartly swapped the order of [प
, ि
] to [ ि
,प
] as in typeset output ि
is to appear to the left of प
. But while doing so, it also "baked" this erroneous order of glyphs into the pdf. Thus while searching the pdf पि
doesn’t produce a hit, and while copying from pdf िप
gets copied instead of expected पि
. Lastly while traversing the glyph nodes, given TeX did change the internal order of glyphs, the extracted text too has िप
instead of पि
. So the questions for Renderer=Node
are: Is there a way to decipher the real order of glyphs while traversing nodelist? This would help extracting the text, and operating on it in other ways. Is there a way to produce a correct pdf, in which text set in Devanagari can be searched/copied correctly just like text set in Latin script?
Fontspec with Renderer=HarfBuzz
(new): This mode seems to produce, at least in this small test case, a correct pdf for searching/copying text. I am still trying to figure out how to correctly extract text by traversing nodes. The nodelist is structured in a different way, and I think a solution might be there. So the question for Renderer=HarfBuzz
are: What fields in the nodelist should we look at for extracting text in correct order? The fontspec document says "Support for the Harfbuzz renderer is preliminary and may be improved over time.", moreover I vaguely remember (from TUG-2020) that this renderer has some limitations over Node
mode. Can someone list what are the most serious limitations? And what do the authors of fontspec mean by "preliminary", and "may" in above excerpt?
Here’s the test code for Renderer=Node
:
% >>lualatex testdevanode.tex
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
usepackage[callback={}]{nodetree}
%newfontscript{Devanagari}{deva,dev2}
newfontfamily{devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=Node]
begin{document}
NodetreeRegisterCallback{hpack_filter}
setbox0=hbox{Hello devanagarifam पिताजी}
box0
NodetreeUnregisterCallback{hpack_filter}
end{document}
And for Renderer=HarfBuzz
(ctan’s version of nodetree encounters a bug for Renderer=HarfBuzz
so it has been commented out, you can uncomment it if you download and use the latest version from its GitHub reposition):
% >>lualatex testdevaharf.tex
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
% usepackage[callback={}]{nodetree}
newfontfamily{devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
begin{document}
% NodetreeRegisterCallback{hpack_filter}
setbox0=hbox{Hello devanagarifam पिताजी}
box0
% NodetreeUnregisterCallback{hpack_filter}
end{document}
Update: Am attaching the relevant excerpt from chapter 12 of Unicode Standard specification v13.0 that confirms the violation done by Node
renderer. The link to the document was pointed by user davislor.
[1] When looking at component glyphs that form a unit of sound (lets call it consonant-vowel pair going forward), that form the consonant-vowel pair, it might appear to a person trained in reading Latin based scripts that vowel must have been typed before the consonant, but that’s not the case.
[2] We will discuss below a way that does seem to, at least in this test example, produce a pdf good for copying Devanagari text. Though it might need more testing.
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP