Stack Overflow Asked by Eamon Ryan on December 5, 2021
In my class, we were given this problem. I have no clue how to solve it.
"The program below counts the number of characters in a file, assuming the file is encoded as ASCII. Modify the program so that it counts the number of characters in a file encoded as UTF-8"
#include <stdbool.h>
#include <stdio.h>
typedef unsigned char BYTE;
int main(int argc, char *argv[])
{
if (argc != 2)
{
printf("Usage: ./count INPUTn");
return 1;
}
FILE *file = fopen(argv[1], "r");
if (!file)
{
printf("Could not open file.n");
return 1;
}
int count = 0;
while (true)
{
BYTE b;
fread(&b, 1, 1, file);
if (feof(file))
{
break;
}
count++;
}
printf("Number of characters: %in", count);
}
Can anyone help me solve this?
UTF-8 is designed such that this is trivial. There's a property that's common to all continuation bytes (the bytes you want to ignore), and only found in continuation bytes. What is it?
First Last Number of
Code Code bytes in Byte 1 Byte 2 Byte 3 Byte 4
Point Point encoding
-------- -------- --------- -------- -------- -------- --------
U+000000 U+00007F 1 0xxxxxxx
U+000080 U+0007FF 2 110xxxxx 10xxxxxx
U+000800 U+00FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+010000 U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Then, it's simply a question of doing some bit arithmetic. Bitwise-AND can be used to isolate the bits you want to check. C has an operator for that.
Answered by ikegami on December 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP