Command to count characters in a specified string

Question

Is there a way to count the number of characters in a specified string?

Suppose I had the following code.

documentclass{article}
newcommand{numchars}[1]{noindent The string ``#1" has ? characters.}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}

How would I make it display the correct character count like this

without having to do a manual count?

Steven B. Segletes · Accepted Answer

If your argument contains macros, the answer would need to change.  Spaces count as characters, though that could be adjusted if you desired.

documentclass{article}
usepackage{stringstrings}
newcommand{numchars}[1]{noindent The string ``#1'' has stringlength{#1} characters.}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}

Here's a version that does not count spaces.

documentclass{article}
usepackage{stringstrings}
newcommand{numchars}[1]{%
  convertchar[q]{#1}{ }{}%
  noindent The string ``#1'' has stringlength{thestring} characters.
}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}

And if you wanted to count only alphabetic characters (ignoring numbers, spaces and punctuation)

documentclass{article}
usepackage{stringstrings}
newcommand{numchars}[1]{%
  convertchar[q]{#1}{ }{}%
  alphabetic[q]{thestring}%
  noindent The string ``#1'' has stringlength{thestring} characters.
}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}

Cão · Answer

The command newcommand{numchars}[1]... works well, but I encountered some issues with stringlength in the stringstringspackage. It seems like it has a limit of 500 on the number of characters, returning zero if you go above that. For example, the code:

documentclass[11pt]{amsart}
usepackage{stringstrings}
newcommand{numchars}[1]{noindent The string ``#1'' has stringlength{#1} characters.}

begin{document}
numchars{Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tris- tique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pul- vinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
end{document}

Returns:

The command StrLen in the xstring package seems to work better. The document:

documentclass[11pt]{amsart}
usepackage{xstring}
newcommand{numchars}[1]{noindent The string ``#1'' has {StrLen{#1}} characters.}

begin{document}
numchars{Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tris- tique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pul- vinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
end{document}

Returns:

cfr · Answer

Presumably there's an excellent reason not to use l3 syntax because egreg has not written an answer.

Also note: I do not know what I'm doing.

Caveat emptor...

The expl3 package is used to enable l3 syntax. (The LaTeX equivalent of the latest thing since sliced bread.) xparse is used to easily define a starred form of the command which excludes spaces from the character count.

Because l3 is oblivious to spaces by default, the extra work actually goes into the non-starred form of the command which converts all spaces to xs before doing the count.

Note that this solution will count accented characters as decomposed with pdfTeX. With Xe/LuaTeX, it works provided a font supporting the characters is used. Thanks to comments for discussion.

ExplSyntaxOn  % enable l3 syntax
tl_new:N l_qzx_string_tl  % declare a local token list to hold qzx's string
NewDocumentCommand numchars { s m }{  % command optionally takes a star and requires a single argument
  group_begin:
  tl_set:Nn l_qzx_string_tl { #2 }  % set the token list to the string we've been fed
  IfBooleanF { #1 }  % if there is no star
    {  % then replace all instances of a space (~ in l3 syntax) by instances of x
      tl_replace_all:Nnn l_qzx_string_tl { ~ } { x }
    }
  % the count of the characters in the token list goes straight into the stream to be typeset but we need to add the spaces we want here explicitly using ~
  noindent The~string~``#2"~has~tl_count:N l_qzx_string_tl{}~characters.par  % use par rather than  to avoid complaints about bad boxes
  group_end:
}
ExplSyntaxOff% turn l3 syntax off so everything is back to normal and giraffes are giraffes once more

Complete code:

documentclass{article}
usepackage{expl3,xparse}
ExplSyntaxOn
tl_new:N l_qzx_string_tl
NewDocumentCommand numchars { s m }{
  group_begin:
  tl_set:Nn l_qzx_string_tl { #2 }
  IfBooleanF { #1 }
    {
      tl_replace_all:Nnn l_qzx_string_tl { ~ } { x }
    }
  noindent The~string~``#2"~has~tl_count:N l_qzx_string_tl{}~characters.par
  group_end:
}
ExplSyntaxOff
begin{document}
  numchars{everything}
  numchars*{everything}
  numchars{that's not it!}
  numchars*{that's not it!}
  numchars{weird}
  numchars*{weird}
  numchars{%
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
  numchars*{%
    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices.  Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
end{document}

This will certainly break if fed wonderful things before breakfast, is likely to feel a little fragile prior to lunch and will probably need to spend the afternoon recovering before getting an early night.

As I said, caveat emptor....

Mico · Answer

Even though the OP has stated that he/she isn't interested in a LuaLaTeX-based solution, others may still value having such a solution. :-)

The following solution works with strings of UTF8-encoded characters. Because ASCII-encoded characters are automatically UTF8-encoded, the solution also works with ASCII-encoded strings.

% !TEX TS-program = lualatex
documentclass{article}
usepackage{fontspec}
usepackage{luacode} % for "luastring" macro
newcommand{numchars}[1]{noindent The string ``#1'' has 
    directlua{tex.sprint(unicode.utf8.len(luastring{#1}))} 
    characters.par}

begin{document}
numchars{everything}
numchars{öüß}
end{document}

Aside: If the Lua-side code inappropriately used the function string.len instead of unicode.utf8.len, the macro numchars would report that öüß has 6 characters. This happens because each of the 3 characters in öüß is encoded using 2 bytes in the utf8 system. (The function str.len does a byte count rather than a direct character account; that's OK if each character is encoded using exactly 1 byte, which is the case for the ASCII encoding system, though not for most others.) Likewise, the string ø§¶®€œ¥√Ç± would incorrectly be diagnosed as having 22 [!] rather than just 10 characters, as both € and √ are encoded using 3 bytes and the remaining 8 characters are encoded using 2 bytes each. Clearly, it's important to use the function unicode.utf8.len in the present context.

Manuel · Answer

The problem Mico points out can be solved in @cfr's solution just by using LuaTeX or XeTeX. If one is bounded to pdfTeX engine, a possible solution is to use the amazingly powerful l3regex package.

Edit: As egreg pointed out, I didn't know that there were so many multibyte prefixes.

documentclass{scrartcl}
usepackage{xparse,l3regex}
usepackage[T1]{fontenc}
usepackage[utf8]{inputenc}

ExplSyntaxOn
NewDocumentCommand numchars { s m }
 {
  group_begin:
  tl_set:Nn l_tmpa_tl { #2 }
  IfBooleanF { #1 } { tl_replace_all:Nnn l_tmpa_tl { ~ } { x } }
  regex_replace_all:nnN { [x{C2}-x{DF}].   } { x } l_tmpa_tl
  regex_replace_all:nnN { [x{E0}-x{EF}]..  } { x } l_tmpa_tl
  regex_replace_all:nnN { [x{F0}-x{F4}]... } { x } l_tmpa_tl

The ~ string ~ ``#2'' ~ has ~ tl_count:N l_tmpa_tl space characters
  IfBooleanT { #1 } { ~ (ignoring ~ whitespace)} .par
  group_end:
 }
ExplSyntaxOff

begin{document}
  numchars{ßöü—} % em-dash
  numchars{everything}
  numchars*{everything}
  numchars{that's not it!}
  numchars*{that's not it!}
  numchars{weird}
  numchars*{weird}
end{document}

Result

The string “ßöü—” has 4 characters.
The string “everything” has 10 characters.
The string “everything” has 10 characters (ignoring whitespace).
The string “that’s not it!” has 14 characters.
The string “that’s not it!” has 12 characters (ignoring whitespace).
The string “weird” has 5 characters.
The string “weird” has 5 characters (ignoring whitespace).

ecki · Answer

What about the following solution?
 defnumchars #1{%
   setbox0=hbox{tt#1}
   setbox1=hbox{tt 1}
   count100 = wd0
   count101 = wd1
   dividecount100 by count101
   The string "#1" has thecount100 characters.
}

numchars{This is the character"a!}
will produce
The string "This is the characterä!" has 23 characters.
The macro will work if the characters are in or produced from the charaters in the
used font (for example cmr10).

Command to count characters in a specified string

6 Answers

Caveat emptor...

Complete code:

Add your own answers!

Ask a Question