Get texcount to ignore numbers

Question

Trying to get texcount to not count numbers as words, but setting alphabets=Latin doesn't seem to be solving the problem. Any suggestions?

MWE:

begin{document}
testing 1 2 3 this should be five
end{document}

texcount outputs:

➜ texcount  FORWORDCOUNT.tex 
  File: FORWORDCOUNT.tex
  Encoding: ascii
  Words in text: 8
  Words in headers: 0
  Words outside text (captions, etc.): 0
  Number of headers: 0
  Number of floats/tables/figures: 0
  Number of math inlines: 0
  Number of math displayed: 0

➜ texcount -alphabets=Latin FORWORDCOUNT.tex
  File: FORWORDCOUNT.tex
  Encoding: ascii
  Words in text: 8
  Words in headers: 0
  Words outside text (captions, etc.): 0
  Number of headers: 0
  Number of floats/tables/figures: 0
  Number of math inlines: 0
  Number of math displayed: 0

nick_eu · Accepted Answer

Ugh, figured it out. Needed to set encoding to unicode. This works:

➜   texcount -alphabets=Latin FORWORDCOUNT.tex -unicode                                          
  File: FORWORDCOUNT.tex
  Words in text: 5
  Words in headers: 0
  Words outside text (captions, etc.): 0
  Number of headers: 0
  Number of floats/tables/figures: 0
  Number of math inlines: 0
  Number of math displayed: 0

Einar Rødland · Answer

Updated answer:
Starting with version 3.2, it is possible to distinguish between words and numbers.
There is a new TeXcount instruction, %TC:wordtype {original-rule} {wordtype} {new-rule}, which allows the counting rule to be modified depending on word type: number, mixed (letters and numbers), or nonum (word without any digits).
Add the rule
%TC:wordtype text number ignore

and words in text be ignored. To do the same for "header words" and "other words", also add the following:
%TC:wordtype otherword number ignore
%TC:wordtype headerword number ignore

The text rule is just and alias for the word rule: the other rules also have aliases like oword and hword.
Beware that this feature is somewhat experimental and may change somewhat in future versions.

I'm moving from the comments to an answer, although it's not yet an answer to what is going on.
As mentioned in the comments, the option -alphabets=Latin should have worked without the -unicode option. When I test this on Windows 10, it works as it should.
The only effect of the -unicode option is to ensure the file gets decoded as UTF8, instead of as ASCII which would be the default since the input is pure ASCII. This could influence how the string is represented internally by Perl, but unless you're having an old Perl version, the internal representation should be UTF8.
Could you check which Perl version you run? I'd think perl --version should return that you're running some Perl 5 version.
I did some tests with small scripts like this:
use Encode;
use Devel::Peek;
my $enc=find_encoding('ascii');
my $x=$enc->decode('test123');
Dump($x);
$x=~s/(p{Latin})/[$1]/g;
print $x;

However, no matter what I did, I couldn't get it to output anything other than [t][e][s][t]123 indicating that it had correctly identified the Latin letters.
I even tried my $x=$enc->encode('test123'); which forces a byte representation of $x (not with the UTF8 flag set), but it still gave the same result. I thought the Unicode character classes, eg Latin, might not work if the string is not in UTF8 representation, but that didn't seem to be a problem; maybe it would be on older/other Perl version.
I've been running TeXcount 3.1 on Windows 10 using Perl 5 (v5.24.0), but checked TeXcount 3.0 as well to make sure there were no relevant changes.

Get texcount to ignore numbers

2 Answers

Add your own answers!

Ask a Question