Stack Overflow Asked by Austin Gilbert on January 17, 2021
I have the following dataframe:
user_id <- c(97, 97, 97, 97, 96, 95, 95, 94, 94)
event_id <- c(42, 15, 43, 12, 44, 32, 38, 10, 11)
plan_id <- c(NA, 38, NA, NA, 30, NA, NA, 30, 25)
treatment_id <- c(NA, 20, NA, NA, NA, 28, 41, 17, 32)
system <- c(1, 1, 1, 1, NA, 2, 2, NA, NA)
df <- data.frame(user_id, event_id, plan_id, treatment_id system)
I would like to count the distinct number of user_id
for each column, excluding the NA values. The output I am hoping for is:
user_id event_id plan_id treatment_id system
1 4 4 3 4 2
I tried to leverage mutate_all
, but that was unsuccessful because my data frame is too large. In other functions, I’ve used the following two lines of code to get the nonnull count and the count distinct for each column:
colSums(!is.empty(df[,]))
apply(df[,], 2, function(x) length(unique(x)))
Optimally, I would like to combine the two with an ifelse
to minimize the mutations, as this will ultimately be thrown into a function to be applied with a number of other summary statistics to a list of data frames.
I have tried a brute-force method, where make the values 1 if not null and 0 otherwise and then copy the id to that column if 1. I can then just use the count distinct line from above to get my output. However, I get the wrong values when copying it into the other columns and the number of adjustments is sub optimal. See code:
binary <- cbind(df$user_id, !is.empty(df[,2:length(df)]))
copied <- binary %>% replace(. > 0, binary[.,1])
I’d greatly appreciate your help.
1: Base
sapply(df, function(x){
length(unique(df$user_id[!is.na(x)]))
})
# user_id event_id plan_id treatment_id system
# 4 4 3 3 2
2: Base
aggregate(user_id ~ ind, unique(na.omit(cbind(stack(df), df[1]))[-1]), length)
# ind user_id
#1 user_id 4
#2 event_id 4
#3 plan_id 3
#4 treatment_id 3
#5 system 2
3: tidyverse
df %>%
mutate(key = user_id) %>%
pivot_longer(!key) %>%
filter(!is.na(value)) %>%
group_by(name) %>%
summarise(value = n_distinct(key)) %>%
pivot_wider()
## A tibble: 1 x 5
# event_id plan_id system treatment_id user_id
# <int> <int> <int> <int> <int>
#1 4 3 2 3 4
Correct answer by d.b on January 17, 2021
Using dplyr
you can use summarise
with across
:
library(dplyr)
df %>% summarise(across(.fns = ~n_distinct(user_id[!is.na(.x)])))
# user_id event_id plan_id treatment_id system
#1 4 4 3 3 2
Answered by Ronak Shah on January 17, 2021
A data.table
option with uniqueN
> setDT(df)[, lapply(.SD, function(x) uniqueN(user_id[!is.na(x)]))]
user_id event_id plan_id treatment_id system
1: 4 4 3 3 2
Answered by ThomasIsCoding on January 17, 2021
Thanks @dcarlson I had misunderstood the question:
apply(df, 2, function(x){length(unique(df[!is.na(x), 1]))})
Answered by hello_friend on January 17, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP