TransWikia.com

Input plain text from Word to LaTex

TeX - LaTeX Asked by qwesix on April 11, 2021

I’m trying to read some text from a word document into my LaTex-Files. I just want the plain text without math or formatting.

I tried with input{} but that doesn’t recognize all utf characters:

Package inputenc: Unicode character  (U+0003)
(inputenc)  not set up for use with LaTeX.
Text line contains an invalid character.
PK
documentclass[ngerman, fontsize=12pt]{scrbook}

usepackage[ngerman]{babel}
usepackage[T1]{fontenc}
usepackage[utf8]{inputenc}
usepackage{lmodern}
usepackage{amsmath}
usepackage{amsfonts}
usepackage{amssymb}
usepackage[hidelinks]{hyperref}

usepackage[baselinestretch,linenumbers,lines=30,chars=60,noindent]{stdpage}   


begin{document}

    input{test.docx}

end{document}

One Answer

A .docx file is actually a binary file, more precisely a ZIP archive containing several files that are compressed/decompressed on the spot.

For instance if I do, from the command line interface,

file /usr/local/texlive/2020/texmf-dist/doc/fonts/tex-gyre-math/test-word-texgyre_termes_math.docx
unzip -l /usr/local/texlive/2020/texmf-dist/doc/fonts/tex-gyre-math/test-word-texgyre_termes_math.docx

just to examine a file included in the TeX Live, I get

/usr/local/texlive/2020/texmf-dist/doc/fonts/tex-gyre-math/test-word-texgyre_termes_math.docx: Microsoft Word 2007+

Archive:  /usr/local/texlive/2020/texmf-dist/doc/fonts/tex-gyre-math/test-word-texgyre_termes_math.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1554  01-01-1980 00:00   [Content_Types].xml
      590  01-01-1980 00:00   _rels/.rels
     1290  01-01-1980 00:00   word/_rels/document.xml.rels
    63800  01-01-1980 00:00   word/document.xml
     7105  01-01-1980 00:00   word/theme/theme1.xml
     3222  01-01-1980 00:00   word/settings.xml
    17027  01-01-1980 00:00   word/stylesWithEffects.xml
      296  01-01-1980 00:00   customXml/_rels/item1.xml.rels
    16274  01-01-1980 00:00   word/styles.xml
      341  01-01-1980 00:00   customXml/itemProps1.xml
      631  01-01-1980 00:00   docProps/core.xml
      218  01-01-1980 00:00   customXml/item1.xml
     2218  01-01-1980 00:00   word/fontTable.xml
      428  01-01-1980 00:00   word/webSettings.xml
      998  01-01-1980 00:00   docProps/app.xml
---------                     -------
   115992                     15 files

The document text is somewhere in those .xml files, precisely in document.xml, but cannot be input in TeX in a straightforward way. I tried with a file just containing abcdef and a small extract from the document.xml file is

<w:body>
  <w:p w14:paraId="47EF316A" w14:textId="128C1C44" w:rsidR="004807B4" w:rsidRPr="004807B4" w:rsidRDefault="004807B4">
    <w:pPr>
      <w:rPr>
        <w:lang w:val="en-US"/>
      </w:rPr>
    </w:pPr>
    <w:r>
      <w:rPr>
        <w:lang w:val="en-US"/>
      </w:rPr>
      <w:t>
        abcdef
      </w:t>
    </w:r>
  </w:p>
  <w:sectPr w:rsidR="004807B4" w:rsidRPr="004807B4">
    <w:pgSz w:w="11906" w:h="16838"/>
    <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/>
    <w:cols w:space="708"/>
    <w:docGrid w:linePitch="360"/>
  </w:sectPr>
</w:body>

Save your document “text only”. Then it is a plain text file and you can input it.

Correct answer by egreg on April 11, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP