TransWikia.com

Most efficient method to strip all of the LaTeX code from a document?

TeX - LaTeX Asked on December 10, 2020

Can some people please give me some suggestions for the most efficient method to strip all of the LaTeX code from a document?

The best method that comes to my mind, which I have no clue about how to do, is to use some sort of a latexmk_flat_file command that generates a flat text file (without code) instead of a *.pdf.

Running an optical character recognition on the *.pdf will also result in lots of errors and require a substantial amount of manual clean up.

Blocking and copying the resultant *.pdf file gives unwanted line breaks and doesn’t normally permit select all text spanning multiple pages.

I used a trial version of Tex2Word by Chikrii, but it was unable to properly handle the type of LaTeX business letter that I am currently using.

catdvi appears to have been last updated in the year 2002, and the kpathsea library presently used by TexLive for Mac/OSX does not have what is required to install the universal distribution of catdvi-0.14 — i.e., lkpathsea is missing (and perhaps others).

I would like to keep the tabs, spaces, and original line endings.

This is a task that will need to be completed by me several times each month.

With respect to the working draft perl script written by cmhughes, these are the most common codes (modified for the perl script) that are contained within my LaTeX documents:

s/begin{.*?}([.*?])?({.*?})?//g;
s/end{.*?}//g;
s/hspace*{.*?}//g;
s/vspace*{.*?}//g;
s/tab //g;
s/~\//g;
s/>//g;
s/=//g;
s/textit{//g;
s/newpage//g;
s/{bf underline{//g;
s/{bsi{//g;
s/uuline{//g;
s/underline{//g;
s/}//g;
s///g;
s/~//g;

4 Answers

Here's a little perl script that might get you started. You can use it as

 perl removelatexcode.pl myfile.tex myfile1.tex

and can call it with as many files as you like (or you could pipe into it too).

It does the following:

  • copies your input file, myfile.tex to myfile.tex.bak just in case something goes wrong
  • loops through each line in the file, and only starts working once it hits begin{document}
  • once it is in the main document, it matches patterns such as begin{<myenvironmentname>}, end{environmentname}, <name of command> you can add to it as you see fit.

The way the code stands it won't overwrite the original file. Once you're happy with it, and have tested it to your liking, feel free to go ahead and use the file as

 perl removelatexcode.pl -o myfile.tex

which will overwrite myfile.tex.

Always be careful when using scripts like this- there was no malicious intent here, but, you should test it thoroughly before using it on live files.

If there are some commands for which you wish to keep the argument, for example, underline{keep this argument} then simply populate

my %keeparguments=("textit"=>1,
                        "underline"=>1,
                        );

with the appropriate commands.

removelatexcode.pl

#!/usr/bin/perl 

use strict;
use warnings;
use File::Copy;
use Getopt::Std;

# get the options
my %options=();
getopts("o", %options);


my $inpreamble=1; # switch for in the preamble or not
my $filename;
my @lines=();     # @lines: stores the new lines without commands

# commands for which we want to keep the arguments- populate 
# as necessary
my %keeparguments=("textit"=>1,
                        "underline"=>1,
                        );

while (@ARGV)
{
      # get filename from arguments
      $filename = shift @ARGV; 

      # open the file
      open(INPUTFILE,$filename) or die "Can't open $filename";

      # reset the preamble switch
      $inpreamble=1;

      # reset the lines array
      @lines=();     

      # loop through the lines in the INPUT file
      while(<INPUTFILE>)
      {
          # check that the document has begun
          if($_ =~ m/begin{document.*/)
          {
              $inpreamble=0;   
          }
          # ignore the preamble, and make string substitutions in 
          # the main document
         if(!$inpreamble) 
         {
             # remove begin{<stuff>}[<optional arguments>]
             s/begin{.*?}([.*?])?({.*?})?//g;
             # remove end{<stuff>}
             s/end{.*?}//g;
             # remove <commandname>{with argument}
             while ($_ =~ m/(.*?){.*?}/)
             {
                if($keeparguments{$1})
                {
                  s/.*?{(.*?)}/$1/;
                }
                else
                {
                  s/.*?{.*?}//;
                }
             }
             # print the current line (if we're not overwritting the current file)
             print $_ if(!$options{o});
             push(@lines,$_);
         }
     }

     # close the file
     close(INPUTFILE);

     # if we want to over write the current file
     if ($options{o})
     {
         # make a backup of each file
         my $backupfile= "$filename.bak";
         copy($filename,$backupfile);

         # reopen the input file to overwrite it
         open(INPUTFILE,">",$filename) or die "Can't open $filename";
         print INPUTFILE @lines;
         close(INPUTFILE);

         # output to terminal
         print "Backed up original file to $filename.bakn";
         print "Overwritten original file without commands";
     }
}

exit 

Here's a little test case:

myfile.tex

documentclass{article}
% in the preamble
% in the preamble
% in the preamble
begin{document}

begin{myenvironment}
  text text text text text text text text text text 
  text text text text text text text text text text 
  text text text text text text text text text text 
  text text text text text text text text text text 
end{myenvironment}

mycommand{argument} more text after it anothercommand{another argument}

textit{keep this argument} more text after it anothercommand{another argument} yet more text

anothercommand{another argument} yet more text textit{keep this argument} more text after it 

begin{anotherenvironment}[optional arguments] could have text here
  other other other other other other other other other other 
  other other other other other other other other other other 
  other other other other other other other other other other 
  other other other other other other other other other other 
end{anotherenvironment}

begin{anotherenvironment}[optional arguments]{mandatory args} could have text here
  another another another another another another 
  another another another another another another 
  another another another another another another 
  another another another another another another 
end{anotherenvironment} can have text here

end{document}

and the output of

perl removelatexcode.pl myfile.tex

Output

  text text text text text text text text text text 
  text text text text text text text text text text 
  text text text text text text text text text text 
  text text text text text text text text text text 


 more text after it 

keep this argument more text after it  yet more text

 yet more text keep this argument more text after it 

 could have text here
  other other other other other other other other other other 
  other other other other other other other other other other 
  other other other other other other other other other other 
  other other other other other other other other other other 


 could have text here
  another another another another another another 
  another another another another another another 
  another another another another another another 
  another another another another another another 
 can have text here

A few words about regexp

You'll notice the script uses lines such as

s/begin{.*?}([.*?])?({.*?})?//g;

This matches

  • begin{<environmentname>}
  • begin{<environmentname>}[<optional arguments>]
  • begin{<environmentname>}[<optional arguments>]{<mandatory arguments>}

but it does so in a non-greedy way. The .*? makes it no-greedy, and the ? after the grouping () make them optional. If these matches were greedy (which they would be without the ?) then you would get a lot of potentially unwanted results.

Answered by cmhughes on December 10, 2020

Pandoc accepts many different input formats including LaTeX and can produce a variety of outputs including plain text. To try Pandoc online, visit the Try pandoc! site.

As stated on the Pandoc website:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife. Pandoc can convert documents in markdown, reStructuredText, textile, HTML, DocBook, LaTeX, or MediaWiki markup to

  • HTML formats: XHTML, HTML5, and HTML slide shows using Slidy, Slideous, S5, or DZSlides.
  • Word processor formats: Microsoft Word docx, OpenOffice/LibreOffice ODT, OpenDocument XML
  • Ebooks: EPUB version 2 or 3, FictionBook2
  • Documentation formats: DocBook, GNU TexInfo, Groff man pages
  • TeX formats: LaTeX, ConTeXt, LaTeX Beamer slides
  • PDF via LaTeX
  • Lightweight markup formats: Markdown, reStructuredText, AsciiDoc, MediaWiki markup, Emacs Org-Mode, Textile

Answered by Pandoc User on December 10, 2020

In the spirit of the Pandoc answer, I'd like to suggest the excellent Org-mode for the Emacs editor. Once you are comfortable with Emacs (which might take a few days, but if you want edit lots of text files efficiently, this is a wise investment), Org-mode is very easy to start with, and contains not only powerful export options (including LaTeX, ODT, HTML, and more), is wholly based on plain text files, and comes with task and time management systems and much more.

Disclaimer: Org-mode is a free tool and I'm not affiliated with it;).

Answered by mbork on December 10, 2020

The command detex (on CTAN fits the bill (included in TeXlive), but it is marked "obsolete" and suggests untex and a few others (bot none included in TeXlive/MikTeX as far as I see).

Answered by vonbrand on December 10, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP