TeX - LaTeX Asked on April 3, 2021
While experimenting ways to extract UTF-8 character strings from TeX boxes, I found a post from user micahl-h21 here: UTF-8 text extraction. After looking at the way glyph data is stored in the nodelist, I modified the code to see if another approach works. In my approach, I traverse the components of composed complex glyphs/discs to extract the constituent characters. In his approach, he seems to be passing complex glyphs like ligatures to some function to decompose it. The output printed by both of our codes (for the test string in the example) looks same. Can someone please review, and suggest if both approaches are equally functionally correct (I am aware that my code requires special handling of TeX ligatures, please ignore that). And if yes, which one would be better for performance (I can cache unicode.utf8.char
in my code like him, please ignore that discrepancy in any comment on performance).
Here’s the output text written to terminal and to an output file hello.txt: Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence.
His complete code is at UTF-8 text extraction, the place where our codes differ is that I don’t use his following function (get_unicode
), and just stick with unicode.utf8.char(glyphnodename.char)
applied to glyph components (whereas he applies this function get_unicode
to decompose complex glyphs instead of digging a level deeper in the glyph node to get the decomposed glyphs [as far as I understand]).
local function get_unicode(xchar,font_id)
local current = {}
local uchar = identifiers[font_id].characters[xchar].tounicode
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
print(xchar,uchar,cchar, font_id, i)
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
usepackage{microtype}
usepackage[english]{babel}
usepackage{blindtext}
begin{document}
setbox0=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence.}
directlua{
local glyph_id = node.id("glyph")
local disc_id = node.id("disc")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local minglue = tex.sp("0.2em")
local function nodeText(n)
local t = {}
for x in node.traverse(n) do
% glyph node
if x.id == glyph_id then
if bit32.band(x.subtype,2) csstring~=0 and unicode.utf8.char(x.char) csstring~="“" and unicode.utf8.char(x.char) csstring~="”" then %
for g in node.traverse_id(glyph_id,x.components) do
if bit32.band(g.subtype, 2) csstring~=0 then
for gc in node.traverse_id(glyph_id,g.components) do
table.insert(t,unicode.utf8.char(gc.char))
end
else
table.insert(t,unicode.utf8.char(g.char))
end
end
else
table.insert(t,unicode.utf8.char(x.char))
end
% disc node
elseif x.id == disc_id then
for g in node.traverse_id(glyph_id,x.replace) do
if bit32.band(g.subtype, 2) csstring~=0 then
for gc in node.traverse_id(glyph_id,g.components) do
table.insert(t,unicode.utf8.char(gc.char))
end
else
table.insert(t,unicode.utf8.char(g.char))
end
end
% glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:close()
}
box0
end{document}
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP