TransWikia.com

LuaTeX: UTF-8 character extraction, is one way more accurate and/or preferable than another

TeX - LaTeX Asked on April 3, 2021

While experimenting ways to extract UTF-8 character strings from TeX boxes, I found a post from user micahl-h21 here: UTF-8 text extraction. After looking at the way glyph data is stored in the nodelist, I modified the code to see if another approach works. In my approach, I traverse the components of composed complex glyphs/discs to extract the constituent characters. In his approach, he seems to be passing complex glyphs like ligatures to some function to decompose it. The output printed by both of our codes (for the test string in the example) looks same. Can someone please review, and suggest if both approaches are equally functionally correct (I am aware that my code requires special handling of TeX ligatures, please ignore that). And if yes, which one would be better for performance (I can cache unicode.utf8.char in my code like him, please ignore that discrepancy in any comment on performance).

Here’s the output text written to terminal and to an output file hello.txt: Příliš žluťoučký kůň úpěl ďábelské ódy difference diffierence. His complete code is at UTF-8 text extraction, the place where our codes differ is that I don’t use his following function (get_unicode), and just stick with unicode.utf8.char(glyphnodename.char) applied to glyph components (whereas he applies this function get_unicode to decompose complex glyphs instead of digging a level deeper in the glyph node to get the decomposed glyphs [as far as I understand]).

local function get_unicode(xchar,font_id)
    local current = {}
    local uchar = identifiers[font_id].characters[xchar].tounicode
    for i= 1, string.len(uchar), 4 do
      local cchar = string.sub(uchar, i, i + 3)
      print(xchar,uchar,cchar, font_id, i)
      table.insert(current,char(tonumber(cchar,16)))
    end
    return current
  end
documentclass{article}
usepackage[lmargin=0.5in,tmargin=0.5in,rmargin=0.5in,bmargin=0.5in]{geometry}
usepackage{fontspec}
usepackage{microtype}
usepackage[english]{babel}
usepackage{blindtext}

begin{document}

setbox0=hbox{Příliš žluťoučký textit{kůň} úpěl hbox{ďábelské} ódy difference diffierence.}

directlua{
  local glyph_id = node.id("glyph")
  local disc_id = node.id("disc")
  local glue_id  = node.id("glue")
  local hlist_id = node.id("hlist")
  local vlist_id = node.id("vlist")
  local minglue = tex.sp("0.2em")
  local function nodeText(n)
    local t =  {}
    for x in node.traverse(n) do
      % glyph node
      if x.id == glyph_id then
        if bit32.band(x.subtype,2) csstring~=0 and unicode.utf8.char(x.char) csstring~="“" and unicode.utf8.char(x.char) csstring~="”" then %
          for g in node.traverse_id(glyph_id,x.components) do
            if bit32.band(g.subtype, 2) csstring~=0 then
              for gc in node.traverse_id(glyph_id,g.components) do
                table.insert(t,unicode.utf8.char(gc.char))
              end
            else
              table.insert(t,unicode.utf8.char(g.char))
            end
          end
        else
          table.insert(t,unicode.utf8.char(x.char))
        end
      % disc node
      elseif x.id == disc_id then
        for g in node.traverse_id(glyph_id,x.replace) do
          if bit32.band(g.subtype, 2) csstring~=0 then
            for gc in node.traverse_id(glyph_id,g.components) do
              table.insert(t,unicode.utf8.char(gc.char))
            end
          else
            table.insert(t,unicode.utf8.char(g.char))
          end
        end
        % glue node
      elseif x.id == glue_id and  node.getglue(x) > minglue then      
        table.insert(t," ")
      elseif x.id == hlist_id or x.id == vlist_id then
        table.insert(t,nodeText(x.head))
      end
    end
    return table.concat(t)
  end
  local n = tex.getbox(0)
  print(nodeText(n.head))
  local f = io.open("hello.txt","w")
  f:write(nodeText(n.head))
  f:close()

}

box0

end{document}

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP