TransWikia.com

Is there any reason to use inputenc?

TeX - LaTeX Asked on February 6, 2021

My LaTeX text editor is GNU Emacs 25.1.1, which encodes text files in UTF-8. Is there any reason to specify

usepackage[utf8]{inputenc}

in the preamble? Even if I migrate to a different computer with a different TeX installation, is there any risk that the migrated LaTeX files will be misinterpreted if I leave out this line?

6 Answers

The basic LaTeX/TeX engine expects (or perhaps is meant to process) pure ASCII input. Whenever your file uses any other characters, the inputenc package comes to the rescue, specifying to the engine how to process the symbols you're typing.

So it's quite necessary, whenever you use unicode (non ASCII) characters, to use the inputenc package, in order to have a meaningful output (or sometimes to make a successful run of (La)TeX)

The difference comes with the "naturally UTF8 compliant" engines, such as LuaTeX and XeTeX, which automatically interpret the input files as UTF8 and won't accept different input encodings: in that cases usepackage[utf8]{inputenc} can be omitted, since it does basically nothing (and is not used internally anyway)

To put it in other terms, the programs do not check whether the file characters comply to the ASCII standards, they simply interpret them to be as such.

Answered by Moriambar on February 6, 2021

With the 2018 release of LaTeX the test file below produces

enter image description here

as UTF-8 is assumed as the default input encoding unless you specify a different encoding to inputenc and the BOM at the start of the file is handled gracefully (ignored in this case).


Original answer

With inputenc commented out I get

enter image description here

despite typing the input in emacs.

documentclass{article}

usepackage[T1]{fontenc}
%usepackage[utf8]{inputenc}

begin{document}

© David Carlisle and cost £2000.
end{document}

Since there seem to be some discussion about the BOM..

If the above file is saved with the byte order mark (or any printable character) before the documentclass then you get an error

! LaTeX Error: Missing begin{document}.

but this is not built in to the TeX engine, it is just the default setting of the characters which can be changed depending how you call LaTeX

The commandline

pdflatex 'catcode"EF=9catcode"BB=9catcode"BF=9 input' testfile

would declare the BOM safe and latex would then process the file without error and give the same bad output as shown above. The presence of the BOM in no way implies UTF-8 encoding to the system.

Answered by David Carlisle on February 6, 2021

There are two types of TeX engines: ones that expect UTF-8 (e.g. LuaTeX and XeTeX), and ones that don't (e.g. pdftex).

If a TeX engine that expects UTF-8 is fed a UTF-8-encoded .tex file, the usepackage[utf8]{inputenc} command can be omitted. In fact, a warning will be issued if it's not.

If a TeX engine doesn't expect UTF-8, and the TeX file contains non-ASCII characters, and the file doesn't contain a suitable inputenc command, strange output may result, as demonstrated by David's answer.

Answered by Evan Aad on February 6, 2021

Here are some examples to make explicit some detail (implicit in the other answers), which may help clear up any remaining confusion.

Consider the following Unicode text añ©ⱥ which consists of:

A byte is a number from 0 to 255 in decimal, or 00 to FF in hexadecimal. So when encoded with UTF-8, the above "four-character" string corresponds to, in the file, the 8 bytes 61 C3 B1 C2 A9 E2 B1 A5.

tex/pdftex WITHOUT inputenc

The engine sees the input as a stream of bytes (8 bytes in the above example). It considers each of them as as a character, and decides to typeset the corresponding character from either the T1 (Cork) encoding or OT1 encoding (the default) or whatever is set up. Examples:

OT1 Above, OT1 has no characters for those bytes so nothing gets typeset. T1 I hope you can see what's happening: each of the 8 bytes is treated as a character and output: e.g. the byte C3 is “Ô in T1.

tex/pdftex WITH inputenc

With usepackage[utf8]{inputenc}, TeX correctly sees every sequence of UTF-8 bytes as a Unicode character. For example, when TeX sees the byte sequence C3 B1, it understands that you mean the Unicode character U+00F1. (The way this is done is that bytes larger than 127 (80 to FF in hexadecimal) are set up to be active characters that expect further input — this is possible because of a useful design of UTF-8. See texdoc utf8ienc for details.)

TeX still needs to know what to do with that Unicode character. A big bunch of definitions (such as DeclareUnicodeCharacter{00F1}{~n} saying what to do with the character U+00F1) are included in the TeX distribution (file texmf-dist/tex/latex/base/utf8.def on TeX Live). So using usepackage[utf8]{inputenc} will help if your characters have such definitions (again, see texdoc utf8ienc for the full list), or if you're willing to define them yourself.

inputenc

With a Unicode-aware engine (XeTeX or LuaTeX)

You don't need inputenc. The engine will expect UTF-8 (by default), and understand the input simply as Unicode characters, and for each of them it simply typesets that character from the currently selected font.

xetex

What about BOM?

With UTF-8 the BOM (byte order mark) isn't needed (it was meant for non-byte-oriented encodings, like UTF-16 and UTF-32), and is strongly discouraged. Typical “good” editors won't include it. Just forget about it; you aren't likely to encounter it in practice.

But if somehow your file does end up including it, then it's just a sequence of bytes EF BB BF (the UTF-8 encoding of U+FEFF), and I think you have enough information above to work out what would happen if those bytes were present in the file at what place.

What if my file contains only "normal" characters?

If you mean Latin-script characters without accents, then UTF-8 has the property that it coincides with ASCII on the range 0 to 127 (00 to 7F). So a file containing only those characters, encoded in UTF-8, is indistinguishable from one encoded in ASCII. Naturally, the output is identical too.

Answered by ShreevatsaR on February 6, 2021

When using TexShop on MacOS Catalina, inputenc is not the whole story. You also have to set UTF-8 in the TexShop preferences. Tab Source, Pulldown Encoding.

Answered by AlDante on February 6, 2021

The answer in 2020 is different from when you asked the question. The LaTeX kernel has selected usepackage[utf8]{inputenc} by default since April 2018.

As of 2020, you only need (and should only use) inputenc if you need to compile files in another encoding on a non-UTF-8 engine. Even there, it might be a good idea to use selinput, so that the file will still work if it gets converted to another encoding.

Answered by Davislor on February 6, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP