Extraction of high level objects from .tex source files

Question

I have idea of a tool to conveniently skim through arXiv papers/articles. In short, it ought filter out parts of document based on its type (images, tables, formulas, paragraphs etc). Obviously it would be magnitudes of order easier to perform on HTML rather than PDF. AFAIK PDF doesn't contain any info about document structure, basically it's a set of instructions "draw this glyph there", so restoring the structure becomes (an unnecessary) non-trivial OCR problem.
Straightforward solution then is to convert paper .tex source to HTML. Available options are:

lwarp: seems like it's for writing with HTML target in mind from outset, requires a lot of source patching
latex2html: pretty robust, but some content is converted to images which is not great for responsiveness/a11y
pandoc: ditched by arxiv-vanity/engrafo devs; couldn't find any comparisons with other tools
make4ht/tex4ht: not sure I understand its internal translation process (.tex -> DVI -> HTML?), overall performs fine with occasional artifacts which I have no idea how to fix (hooking into LuaTeX internals?), but some papers fail to render
latexml: there are efforts to process an entire arXiv to HTML, and looking at numbers it's not bad, but haven't tested myself yet; that's what arxiv-vanity uses and from user's perspective conversion quality is on par with tex4ht

Most likely I will use make4ht or latexml after additional testing, but an idea
came to me to extract relatively "high level" objects as if I would have thrown out any typesetting; what TeX engine gets after LaTeX content macros are processed. LuaTeX glyph nodes seem to be too low-level for my purpose.
So my question is does LuaTeX/TeX somewhere internally operates on something akin to abstract syntax tree of document?

Extraction of high level objects from .tex source files

Add your own answers!

Ask a Question