TeX - LaTeX Asked by theuses on June 5, 2021
I have idea of a tool to conveniently skim through arXiv papers/articles. In short, it ought filter out parts of document based on its type (images, tables, formulas, paragraphs etc). Obviously it would be magnitudes of order easier to perform on HTML rather than PDF. AFAIK PDF doesn’t contain any info about document structure, basically it’s a set of instructions "draw this glyph there", so restoring the structure becomes (an unnecessary) non-trivial OCR problem.
Straightforward solution then is to convert paper .tex
source to HTML. Available options are:
lwarp
: seems like it’s for writing with HTML target in mind from outset, requires a lot of source patchinglatex2html
: pretty robust, but some content is converted to images which is not great for responsiveness/a11ypandoc
: ditched by arxiv-vanity/engrafo devs; couldn’t find any comparisons with other toolsmake4ht/tex4ht
: not sure I understand its internal translation process (.tex
-> DVI -> HTML?), overall performs fine with occasional artifacts which I have no idea how to fix (hooking into LuaTeX internals?), but some papers fail to renderlatexml
: there are efforts to process an entire arXiv to HTML, and looking at numbers it’s not bad, but haven’t tested myself yet; that’s what arxiv-vanity uses and from user’s perspective conversion quality is on par with tex4ht
Most likely I will use make4ht
or latexml
after additional testing, but an idea
came to me to extract relatively "high level" objects as if I would have thrown out any typesetting; what TeX engine gets after LaTeX content macros are processed. LuaTeX glyph nodes seem to be too low-level for my purpose.
So my question is does LuaTeX/TeX somewhere internally operates on something akin to abstract syntax tree of document?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP