Most efficient method to strip all of the LaTeX code from a document?

Question

Can some people please give me some suggestions for the most efficient method to strip all of the LaTeX code from a document?

The best method that comes to my mind, which I have no clue about how to do, is to use some sort of a latexmk_flat_file command that generates a flat text file (without code) instead of a *.pdf.

Running an optical character recognition on the *.pdf will also result in lots of errors and require a substantial amount of manual clean up.

Blocking and copying the resultant *.pdf file gives unwanted line breaks and doesn't normally permit select all text spanning multiple pages.

I used a trial version of Tex2Word by Chikrii, but it was unable to properly handle the type of LaTeX business letter that I am currently using.

catdvi appears to have been last updated in the year 2002, and the kpathsea library presently used by TexLive for Mac/OSX does not have what is required to install the universal distribution of catdvi-0.14 -- i.e., lkpathsea is missing (and perhaps others).

I would like to keep the tabs, spaces, and original line endings.

This is a task that will need to be completed by me several times each month.

With respect to the working draft perl script written by cmhughes, these are the most common codes (modified for the perl script) that are contained within my LaTeX documents:

s/begin{.*?}([.*?])?({.*?})?//g;
s/end{.*?}//g;
s/hspace*{.*?}//g;
s/vspace*{.*?}//g;
s/tab //g;
s/~\//g;
s/>//g;
s/=//g;
s/textit{//g;
s/newpage//g;
s/{bf underline{//g;
s/{bsi{//g;
s/uuline{//g;
s/underline{//g;
s/}//g;
s///g;
s/~//g;

cmhughes · Answer

Here's a little perl script that might get you started. You can use it as perl removelatexcode.pl myfile.tex myfile1.tex and can call it with as many files as you like (or you could pipe into it too). It does the following: copies your input file, myfile.tex to myfile.tex.bak just in case something goes wrong loops through each line in the file, and only starts working once it hits begin{document} once it is in the main document, it matches patterns such as begin{}, end{environmentname}, you can add to it as you see fit. The way the code stands it won't overwrite the original file. Once you're happy with it, and have tested it to your liking, feel free to go ahead and use the file as perl removelatexcode.pl -o myfile.tex which will overwrite myfile.tex. Always be careful when using scripts like this- there was no malicious intent here, but, you should test it thoroughly before using it on live files. If there are some commands for which you wish to keep the argument, for example, underline{keep this argument} then simply populate my %keeparguments=("textit"=>1, "underline"=>1, ); with the appropriate commands. removelatexcode.pl #!/usr/bin/perl use strict; use warnings; use File::Copy; use Getopt::Std; # get the options my %options=(); getopts("o", %options); my $inpreamble=1; # switch for in the preamble or not my $filename; my @lines=(); # @lines: stores the new lines without commands # commands for which we want to keep the arguments- populate # as necessary my %keeparguments=("textit"=>1, "underline"=>1, ); while (@ARGV) { # get filename from arguments $filename = shift @ARGV; # open the file open(INPUTFILE,$filename) or die "Can't open $filename"; # reset the preamble switch $inpreamble=1; # reset the lines array @lines=(); # loop through the lines in the INPUT file while() { # check that the document has begun if($_ =~ m/begin{document.*/) { $inpreamble=0; } # ignore the preamble, and make string substitutions in # the main document if(!$inpreamble) { # remove begin{}[] s/begin{.*?}([.*?])?({.*?})?//g; # remove end{} s/end{.*?}//g; # remove {with argument} while ($_ =~ m/(.*?){.*?}/) { if($keeparguments{$1}) { s/.*?{(.*?)}/$1/; } else { s/.*?{.*?}//; } } # print the current line (if we're not overwritting the current file) print $_ if(!$options{o}); push(@lines,$_); } } # close the file close(INPUTFILE); # if we want to over write the current file if ($options{o}) { # make a backup of each file my $backupfile= "$filename.bak"; copy($filename,$backupfile); # reopen the input file to overwrite it open(INPUTFILE,">",$filename) or die "Can't open $filename"; print INPUTFILE @lines; close(INPUTFILE); # output to terminal print "Backed up original file to $filename.bakn"; print "Overwritten original file without commands"; } } exit Here's a little test case: myfile.tex documentclass{article} % in the preamble % in the preamble % in the preamble begin{document} begin{myenvironment} text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text end{myenvironment} mycommand{argument} more text after it anothercommand{another argument} textit{keep this argument} more text after it anothercommand{another argument} yet more text anothercommand{another argument} yet more text textit{keep this argument} more text after it begin{anotherenvironment}[optional arguments] could have text here other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other end{anotherenvironment} begin{anotherenvironment}[optional arguments]{mandatory args} could have text here another another another another another another another another another another another another another another another another another another another another another another another another end{anotherenvironment} can have text here end{document} and the output of perl removelatexcode.pl myfile.tex Output text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text text more text after it keep this argument more text after it yet more text yet more text keep this argument more text after it could have text here other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other other could have text here another another another another another another another another another another another another another another another another another another another another another another another another can have text here A few words about regexp You'll notice the script uses lines such as s/begin{.*?}([.*?])?({.*?})?//g; This matches begin{} begin{}[] begin{}[]{} but it does so in a non-greedy way. The .*? makes it no-greedy, and the ? after the grouping () make them optional. If these matches were greedy (which they would be without the ?) then you would get a lot of potentially unwanted results.

Pandoc User · Answer

Pandoc accepts many different input formats including LaTeX and can produce a variety of outputs including plain text. To try Pandoc online, visit the Try pandoc! site.

As stated on the Pandoc website:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in markdown, reStructuredText, textile, HTML, DocBook, LaTeX, or MediaWiki markup to

HTML formats: XHTML, HTML5, and HTML slide shows using Slidy, Slideous, S5, or DZSlides.
  Word processor formats: Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML
  Ebooks: EPUB version 2 or 3, FictionBook2
  Documentation formats: DocBook, GNU TexInfo, Groff man pages
  TeX formats: LaTeX, ConTeXt, LaTeX Beamer slides
  PDF via LaTeX
  Lightweight markup formats: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, Emacs Org-Mode, Textile

mbork · Answer

In the spirit of the Pandoc answer, I'd like to suggest the excellent Org-mode for the Emacs editor. Once you are comfortable with Emacs (which might take a few days, but if you want edit lots of text files efficiently, this is a wise investment), Org-mode is very easy to start with, and contains not only powerful export options (including LaTeX, ODT, HTML, and more), is wholly based on plain text files, and comes with task and time management systems and much more.

Disclaimer: Org-mode is a free tool and I'm not affiliated with it;).

Disclaimer: Org-mode is a free tool and I'm not affiliated with it;).

vonbrand · Answer

The command detex (on CTAN fits the bill (included in TeXlive), but it is marked "obsolete" and suggests untex and a few others (bot none included in TeXlive/MikTeX as far as I see).

Most efficient method to strip all of the LaTeX code from a document?

4 Answers

Add your own answers!

Ask a Question