TeX - LaTeX Asked on April 9, 2021
Given a file with some arbitrary name and possibly incorrect extension, how to programatically and/or using command line tools figure out if it is a LaTeX file?
I was playing around with data mining source files in ArXiV for some hobby ML work but I ran into a problem.
The source files are packaged as big tar.gz blobs in amazon s3 which you can download after paying a fee and decompress using tar -xvf,
Once decompressed you get a folder structure
FolderNameWithMonth/
file1.gz
file2.gz
file3.pdf
file4.gz
file5.pdf
...
So basically when someone uploads a pre-print to ArXiV, if they upload nothing other than a pdf, then their file will be a pdf in that folder i described above. If they upload a collection of files (source.tex, images, etc…) it gets zipped into a gzip.
But here’s the kicker, some of those gzip files I listed above, aren’t actually gzip files at all! They just HAPPEN to be named "filename.gz" and if you were to open them up with a notepad (or any other text editor of your choice) you’d be shocked to find they are actually LaTeX files. They’ve merely been misnamed with the .gz extension.
So that leads to our problem, how do programatically weed out the gzip imposters here (i.e. check if their contents are secretly valid LaTeX, and therefore don’t need to be unzipped).
Given a file you could try to unzip it, if that fails, then try to compile it (what compiler should I use?), if that also fails then the file is probably some other weird format that I haven’t considered. But compiling each file is a very expensive operation if you have a lot of files
Is there a key word that one can search for? I was thinking to scan the file as a string of characters and look for end{document}
but it turns out the normal gzips sometimes might have that in their text as well so this is less effective than I hoped.
Send angry email to arxiv to label their file extensions correctly but they are already very stressed and underpaid so this might not be an effective strategy.
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP