TeX - LaTeX Asked by nick_eu on September 6, 2021
Trying to get texcount
to not count numbers as words, but setting alphabets=Latin
doesn’t seem to be solving the problem. Any suggestions?
MWE:
begin{document}
testing 1 2 3 this should be five
end{document}
texcount
outputs:
➜ texcount FORWORDCOUNT.tex
File: FORWORDCOUNT.tex
Encoding: ascii
Words in text: 8
Words in headers: 0
Words outside text (captions, etc.): 0
Number of headers: 0
Number of floats/tables/figures: 0
Number of math inlines: 0
Number of math displayed: 0
➜ texcount -alphabets=Latin FORWORDCOUNT.tex
File: FORWORDCOUNT.tex
Encoding: ascii
Words in text: 8
Words in headers: 0
Words outside text (captions, etc.): 0
Number of headers: 0
Number of floats/tables/figures: 0
Number of math inlines: 0
Number of math displayed: 0
Ugh, figured it out. Needed to set encoding to unicode
. This works:
➜ texcount -alphabets=Latin FORWORDCOUNT.tex -unicode
File: FORWORDCOUNT.tex
Words in text: 5
Words in headers: 0
Words outside text (captions, etc.): 0
Number of headers: 0
Number of floats/tables/figures: 0
Number of math inlines: 0
Number of math displayed: 0
Correct answer by nick_eu on September 6, 2021
Updated answer:
Starting with version 3.2, it is possible to distinguish between words and numbers.
There is a new TeXcount instruction, %TC:wordtype {original-rule} {wordtype} {new-rule}
, which allows the counting rule to be modified depending on word type: number
, mixed
(letters and numbers), or nonum
(word without any digits).
Add the rule
%TC:wordtype text number ignore
and words in text be ignored. To do the same for "header words" and "other words", also add the following:
%TC:wordtype otherword number ignore
%TC:wordtype headerword number ignore
The text
rule is just and alias for the word
rule: the other rules also have aliases like oword
and hword
.
Beware that this feature is somewhat experimental and may change somewhat in future versions.
I'm moving from the comments to an answer, although it's not yet an answer to what is going on.
As mentioned in the comments, the option -alphabets=Latin
should have worked without the -unicode
option. When I test this on Windows 10, it works as it should.
The only effect of the -unicode
option is to ensure the file gets decoded as UTF8, instead of as ASCII which would be the default since the input is pure ASCII. This could influence how the string is represented internally by Perl, but unless you're having an old Perl version, the internal representation should be UTF8.
Could you check which Perl version you run? I'd think perl --version
should return that you're running some Perl 5 version.
I did some tests with small scripts like this:
use Encode;
use Devel::Peek;
my $enc=find_encoding('ascii');
my $x=$enc->decode('test123');
Dump($x);
$x=~s/(p{Latin})/[$1]/g;
print $x;
However, no matter what I did, I couldn't get it to output anything other than [t][e][s][t]123
indicating that it had correctly identified the Latin letters.
I even tried my $x=$enc->encode('test123');
which forces a byte representation of $x
(not with the UTF8 flag set), but it still gave the same result. I thought the Unicode character classes, eg Latin
, might not work if the string is not in UTF8 representation, but that didn't seem to be a problem; maybe it would be on older/other Perl version.
I've been running TeXcount 3.1 on Windows 10 using Perl 5 (v5.24.0), but checked TeXcount 3.0 as well to make sure there were no relevant changes.
Answered by Einar Rødland on September 6, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP