Data Science Asked by lupi5 on December 16, 2020
I am currently working on a dataset which contains a name
attribute, which stands for a person’s first name. After reading the csv file with read.csv
, the variable is a factor
by default (stringsAsFactors=TRUE
) with ~10k levels. Since name
does not reflect any group membership, I am uncertain to leave it as factor
.
Is it necessary to convert name
to character
? Are there some advantages in doing (or not doing) this? Does it even matter?
Factors are stored as numbers and a table of levels. If you have categorical data, storing it as a factor may save lots of memory.
For example, if you have a vector of length 1,000 stored as character and the strings are all 100 characters long, it will take about 100,000 bytes. If you store it as a factor, it will take about 8,000 bytes plus the sum of the lengths of the different factors.
Comparisons with factors should be quicker too because equality is tested by comparing the numbers, not the character values.
The advantage of keeping it as character comes when you want to add new items, since you are now changing the levels.
Store them as whatever makes the most sense for what the data represent. If name
is not categorical, and it sounds like it isn't, then use character.
Correct answer by Spacedman on December 16, 2020
A few thoughts on the question above:
Happy coding!
Answered by Anna-Marie Tomm on December 16, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP