TeX - LaTeX Asked by qzx on May 30, 2021
Is there a way to count the number of characters in a specified string?
Suppose I had the following code.
documentclass{article}
newcommand{numchars}[1]{noindent The string ``#1" has ? characters.}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}
How would I make it display the correct character count like this
without having to do a manual count?
If your argument contains macros, the answer would need to change. Spaces count as characters, though that could be adjusted if you desired.
documentclass{article}
usepackage{stringstrings}
newcommand{numchars}[1]{noindent The string ``#1'' has stringlength{#1} characters.}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}
Here's a version that does not count spaces.
documentclass{article}
usepackage{stringstrings}
newcommand{numchars}[1]{%
convertchar[q]{#1}{ }{}%
noindent The string ``#1'' has stringlength{thestring} characters.
}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}
And if you wanted to count only alphabetic characters (ignoring numbers, spaces and punctuation)
documentclass{article}
usepackage{stringstrings}
newcommand{numchars}[1]{%
convertchar[q]{#1}{ }{}%
alphabetic[q]{thestring}%
noindent The string ``#1'' has stringlength{thestring} characters.
}
begin{document}
numchars{everything}
numchars{that's not it!}
numchars{weird}
end{document}
Correct answer by Steven B. Segletes on May 30, 2021
The command newcommand{numchars}[1]
... works well, but I encountered some issues with stringlength
in the stringstrings
package. It seems like it has a limit of 500 on the number of characters, returning zero if you go above that. For example, the code:
documentclass[11pt]{amsart}
usepackage{stringstrings}
newcommand{numchars}[1]{noindent The string ``#1'' has stringlength{#1} characters.}
begin{document}
numchars{Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tris- tique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pul- vinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
end{document}
Returns:
The command StrLen
in the xstring
package seems to work better. The document:
documentclass[11pt]{amsart}
usepackage{xstring}
newcommand{numchars}[1]{noindent The string ``#1'' has {StrLen{#1}} characters.}
begin{document}
numchars{Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tris- tique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pul- vinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
end{document}
Returns:
Answered by Cão on May 30, 2021
Presumably there's an excellent reason not to use l3 syntax because egreg has not written an answer.
Also note: I do not know what I'm doing.
The expl3
package is used to enable l3 syntax. (The LaTeX equivalent of the latest thing since sliced bread.) xparse
is used to easily define a starred form of the command which excludes spaces from the character count.
Because l3 is oblivious to spaces by default, the extra work actually goes into the non-starred form of the command which converts all spaces to x
s before doing the count.
Note that this solution will count accented characters as decomposed with pdfTeX. With Xe/LuaTeX, it works provided a font supporting the characters is used. Thanks to comments for discussion.
ExplSyntaxOn % enable l3 syntax
tl_new:N l_qzx_string_tl % declare a local token list to hold qzx's string
NewDocumentCommand numchars { s m }{ % command optionally takes a star and requires a single argument
group_begin:
tl_set:Nn l_qzx_string_tl { #2 } % set the token list to the string we've been fed
IfBooleanF { #1 } % if there is no star
{ % then replace all instances of a space (~ in l3 syntax) by instances of x
tl_replace_all:Nnn l_qzx_string_tl { ~ } { x }
}
% the count of the characters in the token list goes straight into the stream to be typeset but we need to add the spaces we want here explicitly using ~
noindent The~string~``#2"~has~tl_count:N l_qzx_string_tl{}~characters.par % use par rather than to avoid complaints about bad boxes
group_end:
}
ExplSyntaxOff% turn l3 syntax off so everything is back to normal and giraffes are giraffes once more
documentclass{article}
usepackage{expl3,xparse}
ExplSyntaxOn
tl_new:N l_qzx_string_tl
NewDocumentCommand numchars { s m }{
group_begin:
tl_set:Nn l_qzx_string_tl { #2 }
IfBooleanF { #1 }
{
tl_replace_all:Nnn l_qzx_string_tl { ~ } { x }
}
noindent The~string~``#2"~has~tl_count:N l_qzx_string_tl{}~characters.par
group_end:
}
ExplSyntaxOff
begin{document}
numchars{everything}
numchars*{everything}
numchars{that's not it!}
numchars*{that's not it!}
numchars{weird}
numchars*{weird}
numchars{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
numchars*{%
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit, vestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida mauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna. Donec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra metus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus eu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium quis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean faucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Curabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue eu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim rutrum.}
end{document}
This will certainly break if fed wonderful things before breakfast, is likely to feel a little fragile prior to lunch and will probably need to spend the afternoon recovering before getting an early night.
As I said, caveat emptor....
Answered by cfr on May 30, 2021
Even though the OP has stated that he/she isn't interested in a LuaLaTeX-based solution, others may still value having such a solution. :-)
The following solution works with strings of UTF8-encoded characters. Because ASCII-encoded characters are automatically UTF8-encoded, the solution also works with ASCII-encoded strings.
% !TEX TS-program = lualatex
documentclass{article}
usepackage{fontspec}
usepackage{luacode} % for "luastring" macro
newcommand{numchars}[1]{noindent The string ``#1'' has
directlua{tex.sprint(unicode.utf8.len(luastring{#1}))}
characters.par}
begin{document}
numchars{everything}
numchars{öüß}
end{document}
Aside: If the Lua-side code inappropriately used the function string.len
instead of unicode.utf8.len
, the macro numchars
would report that öüß
has 6 characters. This happens because each of the 3 characters in öüß
is encoded using 2 bytes in the utf8 system. (The function str.len
does a byte count rather than a direct character account; that's OK if each character is encoded using exactly 1 byte, which is the case for the ASCII encoding system, though not for most others.) Likewise, the string ø§¶®€œ¥√DZ
would incorrectly be diagnosed as having 22 [!] rather than just 10 characters, as both €
and √
are encoded using 3 bytes and the remaining 8 characters are encoded using 2 bytes each. Clearly, it's important to use the function unicode.utf8.len
in the present context.
Answered by Mico on May 30, 2021
The problem Mico points out can be solved in @cfr's solution just by using LuaTeX or XeTeX. If one is bounded to pdfTeX engine, a possible solution is to use the amazingly powerful l3regex
package.
Edit: As egreg pointed out, I didn't know that there were so many multibyte prefixes.
documentclass{scrartcl}
usepackage{xparse,l3regex}
usepackage[T1]{fontenc}
usepackage[utf8]{inputenc}
ExplSyntaxOn
NewDocumentCommand numchars { s m }
{
group_begin:
tl_set:Nn l_tmpa_tl { #2 }
IfBooleanF { #1 } { tl_replace_all:Nnn l_tmpa_tl { ~ } { x } }
regex_replace_all:nnN { [x{C2}-x{DF}]. } { x } l_tmpa_tl
regex_replace_all:nnN { [x{E0}-x{EF}].. } { x } l_tmpa_tl
regex_replace_all:nnN { [x{F0}-x{F4}]... } { x } l_tmpa_tl
The ~ string ~ ``#2'' ~ has ~ tl_count:N l_tmpa_tl space characters
IfBooleanT { #1 } { ~ (ignoring ~ whitespace)} .par
group_end:
}
ExplSyntaxOff
begin{document}
numchars{ßöü—} % em-dash
numchars{everything}
numchars*{everything}
numchars{that's not it!}
numchars*{that's not it!}
numchars{weird}
numchars*{weird}
end{document}
Result
The string “ßöü—” has 4 characters.
The string “everything” has 10 characters.
The string “everything” has 10 characters (ignoring whitespace).
The string “that’s not it!” has 14 characters.
The string “that’s not it!” has 12 characters (ignoring whitespace).
The string “weird” has 5 characters.
The string “weird” has 5 characters (ignoring whitespace).
Answered by Manuel on May 30, 2021
What about the following solution?
defnumchars #1{%
setbox0=hbox{tt#1}
setbox1=hbox{tt 1}
count100 = wd0
count101 = wd1
dividecount100 by count101
The string "#1" has thecount100 characters.
}
numchars{This is the character"a!}
will produce The string "This is the characterä!" has 23 characters.
The macro will work if the characters are in or produced from the charaters in the used font (for example cmr10).
Answered by ecki on May 30, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP