Capitalizing strings ignoring closed class words

Question

I was just reviewing my "capitalization standards" for titles and such and was wondering if there's a macro to do the same thing I'm forced to do by hand nowadays. My personal rules (feel free to disagree/comment on them) are as follows:

the first letter (a number is a letter too for the sake of these rules) always gets capitalized,
every word gets capitalized individually,
the exception to the rule above (and just the rule above) are closed class words such as prepositions and the like.

In other words: I'd like a capitalization command (like MakeUppercase) that will capitalize every word not included in a list of words and that will always capitalize the first word of its argument.

Doable?

PS: one such "list" of closed class words (also known as "function words") can be found here.

egreg · Accepted Answer

documentclass[a4paper]{article}
usepackage[latin1]{inputenc}
usepackage{xparse}
ExplSyntaxOn
NewDocumentCommand{capitalize}{>{SplitList{~}}m}{
  CapitalizeFirst#1Capitalizeunskip
}
ExplSyntaxOff
defSentinel{Capitalize}
defCapitalizeFirst#1{MakeUppercase#1 Capitalize}
defCapitalize#1{%
  defnext{#1}%
  ifxnextSentinel
    expandafterunskip
  else
    CheckInList{#1}spaceexpandafterCapitalize
  fi}
defCheckInList#1{%
  ifcsname List@detokenize{#1}endcsname
    #1%
  else
    MakeUppercase#1%
  fi}
makeatletter
defAppendToList#1{%
  @fornext:=#1do
  {expandafterletcsname List@detokenizeexpandafter{next}endcsnameempty}}
makeatother
AppendToList{a,is,of}

begin{document}
capitalize{here is a list of words école}
end{document}

Won't work with UTF-8 (unless XeLaTeX or LuaLaTeX are used).

It won't work with UTF-8 in pdflatex because MakeUppercase will apply only to the first byte of a possible two, three or four byte combination (for Western languages probably only two). For that to work one has to feed the whole block of bytes to MakeUppercase.

To be clearer: when we say MakeUppercase, LaTeX will uppercase the argument; in general the call is MakeUppercase{word}; here we're saying instead MakeUppercase#1 (without braces), so only the first token (usually a character) will be uppercased; here's where it will fail with input such as 'ecole: the token passed to MakeUppercase would be ', which it doesn't know what to do. Using école (and a one byte encoding such as latin1), MakeUppercase will process é and give the correct result.

With UTF-8 this would fail: what we see as é on our screen when writing a LaTeX document is actually two bytes (C3 and A9, for é) and again MakeUppercase would be passed only the first one. So a more complex routine is necessary.

In order to have this work with pdflatex and UTF-8, the definition of CheckInList and CapitalizeFirst above can be changed into the following

defCapitalizeFirst#1{expandafterUC@next#1 Capitalize}
defCheckInList#1{%
  ifcsname List@detokenize{#1}endcsname
    #1%
  else
    expandafterUC@next#1%
  fi}
defUC@next#1{%
  ifx#1UTFviii@two@octets
     expandafter@firstoffour
  else
    ifx#1UTFviii@three@octets
      expandafterexpandafterexpandafter@secondoffour
    else
      ifx#1UTFviii@four@octets
        expandafterexpandafterexpandafterexpandafterexpandafter
        @thirdoffour
      else
        expandafterexpandafterexpandafterexpandafterexpandafter
        expandafterexpandafter@fourthoffour
      fi
    fi
  fi
  {UC@two}{UC@three}{UC@four}{MakeUppercase}#1}
defUC@two#1#2#3{MakeUppercase{#1#2#3}}
defUC@three#1#2#3#4{MakeUppercase{#1#2#3#4}}
defUC@four#1#2#3#4#5{MakeUppercase{#1#2#3#4#5}}
providecommand@firstoffour[4]{#1}
providecommand@secondoffour[4]{#2}
providecommand@thirdoffour[4]{#3}
providecommand@fourthoffour[4]{#4}

However accent commands are not allowed (they aren't also in the other version).

UPDATE

After a few years, here's a better implementation, thanks to new expl3 features; it works for all engines.

documentclass[a4paper]{article}

usepackage{ifxetex}

ifxetex
  usepackage{fontspec}
else
  usepackage[T1]{fontenc}
  usepackage[utf8]{inputenc}
fi

usepackage{xparse}

ExplSyntaxOn
NewDocumentCommand{capitalize}{>{SplitList{~}}m}
 {
  seq_clear:N l_capitalize_words_seq
  ProcessList{#1}{CapitalizeFirst}
  seq_use:Nn l_capitalize_words_seq { ~ }
 }
NewDocumentCommand{CapitalizeFirst}{m}
 {
  capitalize_word:n { #1 }
 }

sys_if_engine_pdftex:TF
 {
  cs_set_eq:Nc capitalize_tl_set:Nn { protected@edef }
 }
 {
  cs_set_eq:NN capitalize_tl_set:Nn tl_set:Nn
 }

cs_new_protected:Nn capitalize_word:n
 {
  capitalize_tl_set:Nn l_capitalize_word_tl { #1 }
  seq_if_in:NfTF g_capitalize_exceptions_seq { tl_to_str:n { #1 } }
   % exception word
   { seq_put_right:Nn l_capitalize_words_seq { #1 } } % exception word
   % to be uppercased
   { seq_put_right:Nx l_capitalize_words_seq { tl_mixed_case:V l_capitalize_word_tl } }
 }
cs_generate_variant:Nn tl_mixed_case:n { V }
NewDocumentCommand{AppendToList}{m}
 {
  clist_map_inline:nn { #1 }
   {
    seq_gput_right:Nx g_capitalize_exceptions_seq { tl_to_str:n { ##1 } }
   }
 }
cs_generate_variant:Nn seq_if_in:NnTF { Nf }
seq_new:N l_capitalize_words_seq
seq_new:N g_capitalize_exceptions_seq
ExplSyntaxOff

AppendToList{a,is,of,óf}

begin{document}
Xcapitalize{here is a list of words óf école}X
end{document}

Marco · Answer

A ConTeXt solution:

You can use the command applytosplitstringwordspaced for this:

defIgnoredWords
  {a,is,to,of,or,and}

define[1]CapitalizeWithIgnoreWord
  {doifinsetelse{#1}IgnoredWords{#1}{Words{#1}}}

defCapitalizeWithIgnore
  {applytosplitstringwordspacedCapitalizeWithIgnoreWord}

starttext
  CapitalizeWithIgnore{This is some of my input or another and to the end.}
stoptext

which gives

The applytosplitstringwordspaced command divides the input into words and applies each word to the macro CapitalizeWithIgnoreWord, which takes one argument. Then I simply test, if the given word is a member of the word list and print it, or print it uppercased.

Steven B. Segletes · Answer

The titlecaps package is newly introduced and demonstrated here: Headings in uppercase.  It will take care of titling diacritical marks (e.g., umlauts, etc.) national symbols (e.g., oe) and is compatible with (i.e., can include in its argument) commands that change the font characteristics, such as textit{}, scshape, and footnotesize.  Further, it allows for words to be designated as lower-cased, for example prepositions and conjunctions, which are to be screened out and not titled.  The presence of punctuation should not affect the ability of the package to either capitalize a word or detect it as a pre-designated lower-cased word.

Nicola Talbot · Answer

The mfirstuc package provides capitalisewords. You can specify the exceptions with MFUnocap. For example:

documentclass{article}

usepackage{mfirstuc}

begin{document}
capitalisewords{the cat sat on the mat.}

MFUnocap{on}
MFUnocap{the}
capitalisewords{the cat sat on the mat.}
end{document}

The mfirstuc-english package (which automatically loads mfirstuc) provides some common exceptions:

documentclass{article}

usepackage{mfirstuc-english}

begin{document}
capitalisewords{the cat sat on the mat.}
end{document}

It doesn't include disputed words or words that may be ignored from case-changes only under certain circumstances. You can localise MFCnocap:

documentclass{article}

usepackage{mfirstuc-english}

begin{document}
{% scope
 MFUnocap{on}
 capitalisewords{the cat sat on the mat.}
}

capitalisewords{the cat sat on the mat.}
end{document}

The switches MFUhyphenfalse and MFUhyphentrue determine whether or not to change the case of parts of hyphenated words. The default is MFUhyphenfalse:

documentclass{article}

usepackage{mfirstuc}

begin{document}
capitalisewords{server-side includes}

MFUhyphentrue
capitalisewords{server-side includes}
end{document}

wipet · Answer

Using csplain you can implement this by a few lines of basic macros (they use only TeX primitives):

defcapitalize#1{capitA#1 {} }
defcapitA#1 {capitW#1 capitB}
defcapitB#1 {ifxrelax#1relax else space
   isinlistextrawords{ #1 }iftrue #1else capitW#1 fi
   expandafter capitB fi
}
defcapitW#1#2 {uppercase{#1}#2}   
defisinlist#1#2#3{begingroup longdeftmp##1#2##2end{deftmp{##2}%
   ifxtmpempty endgroup csname iffalseexpandafterendcsname else
                  endgroup csname iftrueexpandafterendcsname fi}%
   expandaftertmp#1endlistsep#2end
}

defextrawords{ a is of óf }

Xcapitalize{here is a list of words óf école}X

bye

Capitalizing strings ignoring closed class words

5 Answers

UPDATE

Add your own answers!

Ask a Question