TeX - LaTeX Asked by reportaman on November 29, 2020
Devanagari UTF-8 text extraction code doesn’t produce lua error while using fontspec’s Node renderer, but still doesn’t produce correct result due to a possible bug in Node renderer. I have filed that as a separate question: LuaTeX: Devanagari glyph order is reversed in tex’s internal nodelist, how to recover the correct order while traversing glyph nodes?
While experimenting with different techniques for UTF-8 text extraction from TeX boxes, I encountered that the two techniques that do not produce any lua error for extraction of Devanagari text with fontspec’s Node renderer, both produce lua error while using fontspec’s HarfBuzz renderers (Renderer = Harfbuzz, Renderer=OpenType).
The two techniques are detailed here: technique-1 (using micahl-h21‘s get_unicode
function) and here: technique-2 (just applying unicode.utf8.char
on constituent components of complex glyphs). I tried multiple Devanagari fonts, all result in same behavior.
Complete test code for both techniques, and their respective error signatures are listed one after the other in the blocks below. For my example, I used freely available Noto Sans Devanagari (regular weight) available here: link to google fonts GitHub for Noto Sans Devanagari
Technique-1 with Devanagari and HarfBuzz (no lua error if compiled with Node renderer):
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
usepackage{microtype}
%newfontscript{Devanagari}{deva,dev2}
newfontfamily{devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
setbox0=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence. devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
directlua{
% local fontstyles = require "l4fontstyles"
local char = unicode.utf8.char
local glyph_id = node.id("glyph")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local disc_id = node.id("disc")
local minglue = tex.sp("0.2em")
local usedcharacters = {}
local identifiers = fonts.hashes.identifiers
local function get_unicode(xchar,font_id)
local current = {}
local uchar = identifiers[font_id].characters[xchar].tounicode
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
print(xchar,uchar,cchar, font_id, i)
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
local function nodeText(n)
local t = {}
for x in node.traverse(n) do
% glyph node
if x.id == glyph_id then
% local currentchar = fonts.hashes.identifiers[x.font].characters[x.char].tounicode
local chars = get_unicode(x.char,x.font)
for _, current_char in ipairs(chars) do
table.insert(t,current_char)
end
% glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
% discretionaries
elseif x.id == disc_id then
table.insert(t, nodeText(x.replace))
% recursivelly process hlist and vlist nodes
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:close()
}
box0
end{document}
Error signature for Technique-1 (HarfBuzz renderer):
[directlua]:1: bad argument #1 to 'len' (string expected, got nil)
stack traceback:
[C]: in function 'string.len'
[directlua]:1: in upvalue 'get_unicode'
[directlua]:1: in local 'nodeText'
[directlua]:1: in main chunk.
l.62 }
Technique-2 with Devanagari and HarfBuzz (no lua error if compiled with Node renderer):
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
usepackage{microtype}
%newfontscript{Devanagari}{deva,dev2}
newfontfamily{devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
setbox0=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence. devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
directlua{
local glyph_id = node.id("glyph")
local disc_id = node.id("disc")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local minglue = tex.sp("0.2em")
local function nodeText(n)
local t = {}
for x in node.traverse(n) do
% glyph node
if x.id == glyph_id then
if bit32.band(x.subtype,2) csstring~=0 and unicode.utf8.char(x.char) csstring~="“" and unicode.utf8.char(x.char) csstring~="”" then %
for g in node.traverse_id(glyph_id,x.components) do
if bit32.band(g.subtype, 2) csstring~=0 then
for gc in node.traverse_id(glyph_id,g.components) do
table.insert(t,unicode.utf8.char(gc.char))
end
else
table.insert(t,unicode.utf8.char(g.char))
end
end
else
table.insert(t,unicode.utf8.char(x.char))
end
% disc node
elseif x.id == disc_id then
for g in node.traverse_id(glyph_id,x.replace) do
if bit32.band(g.subtype, 2) csstring~=0 then
for gc in node.traverse_id(glyph_id,g.components) do
table.insert(t,unicode.utf8.char(gc.char))
end
else
table.insert(t,unicode.utf8.char(g.char))
end
end
% glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:close()
}
box0
end{document}
Error signature for Technique-2 (HarfBuzz renderer):
[directlua]:1: bad argument #1 to 'char' (invalid value)
stack traceback:
[C]: in field 'char'
[directlua]:1: in local 'nodeText'
[directlua]:1: in main chunk.
l.64 }
In node mode, restoring the full text is not generally possible because you get shaped output and shaped glyphs can not be uniquely mapped back to input text. You can only approximate it by using tounicode values. These map to the actual PDF file ToUnicode CMap entries and therefore follow their restricted model of glyph to Unicode mapping: Every glyph is equivalent to a fixed sequence of unicode codepoints. These mappings are the concatenated in rendering order. As you have seen, this model is not sufficient to map Devanagari glyphs to input text.
You can use harf
mode instead to avoid the issue: harf
mode is not affected by this limited model because it does not just give you a shaped list of glyphs, but additionally creates PDF marked content ActualText entries overriding the ToUnicode mappings in sequences which can not be correclty modeled though ToUnicode. The data necessary for this mapping can be queried from Lua code using the glyph_data
property. (This is an undocumented implementation detail and might change in the future)
If xou want to extract as much as possible from any text, you can combine this property based and the ToUnicode based approach in your Lua code:
Create a file extracttext.lua
with
local type = type
local char = utf8.char
local unpack = table.unpack
local getproperty = node.getproperty
local getfont = font.getfont
local is_glyph = node.is_glyph
-- tounicode id UTF-16 in hex, so we need to handle surrogate pairs...
local utf16hex_to_utf8 do -- Untested, but should more or less work
local l = lpeg
local tonumber = tonumber
local hex = l.R('09', 'af', 'AF')
local byte = hex * hex
local simple = byte * byte / function(s) return char(tonumber(s, 16)) end
local surrogate = l.S'Dd' * l.C(l.R('89', 'AB', 'ab') * byte)
* l.S'Dd' * l.C(l.R('CF', 'cf') * byte) / function(high, low)
return char(0x10000 + ((tonumber(high, 16) & 0x3FF) << 10 | (tonumber(low, 16) & 0x3FF)))
end
utf16hex_to_utf8 = l.Cs((surrogate + simple)^0)
end
-- First the non-harf case
-- Standard caching setup
local identity_table = setmetatable({}, {__index = function(_, id) return char(id) end})
local cached_text = setmetatable({}, {__index = function(t, fid)
local fontdir = getfont(fid)
local characters = fontdir and fontdir.tounicode == 1 and fontdir.characters
local font_cache = characters and setmetatable({}, {__index = function(tt, slot)
local character = characters[slot]
local text = character and character.tounicode or slot
-- At this point we have the tounicode value in text. This can have different forms.
-- The order in the if ... elseif chain is based on how likely it is to encounter them.
-- This is a small performance optimization.
local t = type(text)
if t == 'string' then
text = utf16hex_to_utf8:match(text)
elseif t == 'number' then
text = char(text)
elseif t == 'table' then
text = char(unpack(text)) -- I haven't tested this case, but it should work
end
tt[slot] = text
return text
end}) or identity_table
t[fid] = font_cache
return font_cache
end})
-- Now the tounicode case just has to look up the value
local function from_tounicode(n)
local slot, fid = is_glyph(n)
return cached_text[fid][slot]
end
-- Now the traversing stuff. Nothing interesting to see here except for the
-- glyph case
local traverse = node.traverse
local glyph, glue, disc, hlist, vlist = node.id'glyph', node.id'glue', node.id'disc', node.id'hlist', node.id'vlist'
local extract_text_vlist
-- We could replace i by #t+1 but this should be slightly faster
local function real_extract_text(head, t, i)
for n, id in traverse(head) do
if id == glyph then
-- First handle harf mode: Look for a glyph_info property. If that does not exists
-- use from_tounicode. glyph_info will sometimes/often be an empty string. That's
-- intentional and it should *not* trigger a fallback. The actual mapping will be
-- contained in surrounding chars.
local props = getproperty(n)
t[i] = props and props.glyph_info or from_tounicode(n)
i = i + 1
elseif id == glue then
if n.width > 1001 then -- 1001 is arbitrary but sufficiently high to be bigger than most weird glue uses
t[i] = ' '
i = i + 1
end
elseif id == disc then
i = real_extract_text(n.replace, t, i)
elseif id == hlist then
i = real_extract_text(n.head, t, i)
elseif id == vlist then
i = extract_text_vlist(n.head, t, i)
end
end
return i
end
function extract_text_vlist(head, t, i) -- glue should not become a space here
for n, id in traverse(head) do
if id == hlist then
i = real_extract_text(n.head, t, i)
elseif id == vlist then
i = extract_text_vlist(n.head, t, i)
end
end
return i
end
return function(list)
local t = {}
real_extract_text(list.head, t, 1)
return table.concat(t)
end
This can be used as a normal Lua module:
documentclass{article}
usepackage{fontspec}
newfontfamily{devharf}{Noto Sans Devanagari}[Script=Devanagari, Renderer=HarfBuzz]
newfontfamily{devnode}{Noto Sans Devanagari}[Script=Devanagari, Renderer=Node]
begin{document}
% Devanagari text is at the right end of following line
% of code, you might have to scroll right to read it
setbox0=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence. devharf एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
setbox1=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence. devnode एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
directlua{
local extracttext = require'extracttext'
local f = io.open("hello.harf.txt","w") % Can reproduce the full input text
f:write(extracttext(tex.getbox(0)))
f:close()
f = io.open("hello.node.txt","w") % In node mode, we only get an approximation
f:write(extracttext(tex.getbox(1)))
f:close()
}
box0
box1
end{document}
A more general note: As you can see, there is some work involved when it comes to getting text from a shaped list, especially in the ToUnicode case where we have to map surrogate pairs and stuff. This is mostly because shaped text is not intended for such use. As soon as the glyph nodes are protected (aka. subtype(n) >= 256 or not is_char(n)
is true
), the .char
entries no longer contain Unicode values but internal identifiers, the .font
entry might no longer be the value you expect and some glyphs might not be represented as glyphs at all. In most cases in which you want to actually access the text behind a box and not just the visual display of the text, you really want to intercept the list before it gets shaped in the first place.
Correct answer by Marcel Krüger on November 29, 2020
I don't know too much about how HarfBuzz fonts are handled by Luaotfload, but I was able to find the way how to get the tounicode
fields, thanks to table.serialize
. So my original code adapted for Harfbuzz looks like this:
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
usepackage{microtype}
usepackage{luacode}
%newfontscript{Devanagari}{deva,dev2}
newfontfamily{devanagarifam}{Noto Sans Devanagari}[Script=Devanagari, Scale=1, Renderer=HarfBuzz]
newfontfamily{arabicfam}{Amiri}[Script=Arabic, Scale=1, Renderer=HarfBuzz]
begin{document}
setbox0=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence. devanagarifam एक गांव -- में मोहन नाम का लड़का रहता था। उसके पिताजी एक मामूली मजदूर थे।}
setbox1=hbox{arabicfam textdir TRT هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).}
begin{luacode*}
-- local fontstyles = require "l4fontstyles"
local char = unicode.utf8.char
local glyph_id = node.id("glyph")
local glue_id = node.id("glue")
local hlist_id = node.id("hlist")
local vlist_id = node.id("vlist")
local disc_id = node.id("disc")
local minglue = tex.sp("0.2em")
local usedcharacters = {}
local identifiers = fonts.hashes.identifiers
local fontcache = {}
local function to_unicode_chars(uchar)
local uchar = uchar or ""
-- put characters into a table
local current = {}
-- each codepoint is 4 bytes long, we loop over tounicode entry and cut it into 4 bytes chunks
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
-- codepoint is hex string, we need to convert it to number ad then to UTF8 char
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
-- cache character lookup, to speed up things
local function get_character_from_cache(xchar, font_id)
local current_font = fontcache[font_id] or {characters = {}}
fontcache[font_id] = current_font -- initialize font cache for the current font if it doesn't exist
return current_font.characters[xchar]
end
-- save characters to cache for faster lookup
local function save_character_to_cache(xchar, font_id, replace)
fontcache[font_id][xchar] = replace
-- return value
return replace
end
local function initialize_harfbuzz_cache(font_id, hb)
-- save some harfbuzz tables for faster lookup
local current_font = fontcache[font_id]
-- the unicode data can be in two places
-- 1. hb.shared.glyphs[glyphid].backmap
current_font.glyphs = current_font.glyphs or hb.shared.glyphs
-- 2. hb.shared.unicodes
-- it contains mapping between Unicode and glyph id
-- we must create new table that contains reverse mapping
if not current_font.backmap then
current_font.backmap = {}
for k,v in pairs(hb.shared.unicodes) do
current_font.backmap[v] = k
end
end
-- save it back to the font cache
fontcache[font_id] = current_font
return current_font.glyphs, current_font.backmap
end
local function get_unicode(xchar,font_id)
-- try to load character from cache first
local current_char = get_character_from_cache(xchar, font_id)
if current_char then return current_char end
-- get tounicode for non HarfBuzz fonts
local characters = identifiers[font_id].characters
local uchar = characters[xchar].tounicode
-- stop processing if tounicode exists
if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
-- detect if font is processed by Harfbuzz
local hb = identifiers[font_id].hb
-- try HarfBuzz data
if not uchar and hb then
-- get glyph index of the character
local index = characters[xchar].index
-- load HarfBuzz tables from cache
local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
-- get tounicode field from HarfBuzz glyph info
local tounicode = glyphs[index].tounicode
if tounicode then
return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
end
-- if this fails, try backmap, which contains mapping between glyph index and Unicode
local backuni = backmap[index]
if backuni then
return save_character_to_cache(xchar, font_id, {char(backuni)})
end
-- if this fails too, discard this character
return save_character_to_cache(xchar, font_id, {})
end
-- return just the original char if everything else fails
return save_character_to_cache(xchar, font_id, {char(xchar)})
end
local function nodeText(n)
-- output buffer
local t = {}
for x in node.traverse(n) do
-- glyph node
if x.id == glyph_id then
-- get table with characters for current node.char
local chars = get_unicode(x.char,x.font)
for _, current_char in ipairs(chars) do
-- save characters to the output buffer
table.insert(t,current_char)
end
-- glue node
elseif x.id == glue_id and node.getglue(x) > minglue then
table.insert(t," ")
-- discretionaries
elseif x.id == disc_id then
table.insert(t, nodeText(x.replace))
-- recursivelly process hlist and vlist nodes
elseif x.id == hlist_id or x.id == vlist_id then
table.insert(t,nodeText(x.head))
end
end
return table.concat(t)
end
local n = tex.getbox(0)
local n1 = tex.getbox(1)
print(nodeText(n.head))
local f = io.open("hello.txt","w")
f:write(nodeText(n.head))
f:write(nodeText(n1.head))
f:close()
end{luacode*}
box0
box1
end{document}
I've added also an Arabic sample from Wikipedia. Here is the content of hello.txt
:
Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. एक गांव -- में मोहन नाम का लड़का रहता था। उसके पताजी एक मामूली मजदूर थे।هذه المقالة عن براغ. لتصفح عناوين مشابهة، انظر براغ (توضيح).
Two important functions are these
local function to_unicode_chars(uchar)
local uchar = uchar or ""
local current = {}
for i= 1, string.len(uchar), 4 do
local cchar = string.sub(uchar, i, i + 3)
table.insert(current,char(tonumber(cchar,16)))
end
return current
end
to_unicode_chars
function splits to_unicode entries to four byte chunks, which are then converted to UTF 8 characters. It can also handle glyphs without tounicode
entries, in this case it just returns empty string.
local function get_unicode(xchar,font_id)
-- try to load character from cache first
local current_char = get_character_from_cache(xchar, font_id)
if current_char then return current_char end
-- get tounicode for non HarfBuzz fonts
local characters = identifiers[font_id].characters
local uchar = characters[xchar].tounicode
-- stop processing if tounicode exists
if uchar then return save_character_to_cache(xchar, font_id, to_unicode_chars(uchar)) end
-- detect if font is processed by Harfbuzz
local hb = identifiers[font_id].hb
-- try HarfBuzz data
if not uchar and hb then
-- get glyph index of the character
local index = characters[xchar].index
-- load HarfBuzz tables from cache
local glyphs, backmap = initialize_harfbuzz_cache(font_id, hb)
-- get tounicode field from HarfBuzz glyph info
local tounicode = glyphs[index].tounicode
if tounicode then
return save_character_to_cache(xchar, font_id, to_unicode_chars(tounicode))
end
-- if this fails, try backmap, which contains mapping between glyph index and Unicode
local backuni = backmap[index]
if backuni then
return save_character_to_cache(xchar, font_id, {char(backuni)})
end
-- if this fails too, discard this character
return save_character_to_cache(xchar, font_id, {})
end
-- return just the original char if everything else fails
return save_character_to_cache(xchar, font_id, {char(xchar)})
end
This function first tries to load Uniocode data from current font info. If it fails, it tries to lookup in Harfbuzz tables. Most of characters have tounicode
mapping in the glyphs
table. If it isn't available, it tries the unicodes
table, which contains mapping between glyph indices and Unicode. If even this fails, then we discard this character.
Answered by michal.h21 on November 29, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP