Stack Overflow Asked by Michael Putman on November 30, 2020
I have a dataframe that is broken down by the gender of an author, their role in a project, and an identifier (PMID) (see below).
What I need is to create a 2×2 contingency table so I can calculate an odds ratio for the association between being a female first author and having a female senior author. To get that, I need to calculate the following:
I have grouped the table (see below) by PMID, so I really just need to figure out how to count each instance of the above. Having a hard time, would greatly appreciate any help!
# A tibble: 178,056 x 3
# Groups: pmid [101,907]
gender authorship pmid
<chr> <chr> <chr>
1 male First 18958667
2 male Senior 18958667
3 male First 18958651
4 male First 18751818
5 male Senior 18751818
6 male First 18751811
7 male Senior 18751811
8 female First 18751810
9 female Senior 18751810
10 male First 18088800
11 male Senior 18088800
12 male First 17710072
13 female First 17977216
14 male Senior 17762065
15 male First 17611457
16 male First 17611433
17 male First 17532688
18 male Senior 17532688
19 female First 17405310
20 male Senior 17386862
21 female First 17319096
22 male Senior 17319096
23 female First 17300028
24 male First 17282480
25 female First 17177771
26 male First 17124681
27 female First 17093906
28 female First 17042011
29 male Senior 17042011
30 female First 17042010
31 male Senior 17042010
32 female First 17042006
33 male Senior 17042006
34 female First 17042003
35 female First 17042002
36 male Senior 17042002
37 male First 17042001
38 female First 17041999
39 male Senior 17041997
40 female First 17041995
41 female First 17041994
42 female First 17041993
43 female Senior 17041993
44 female First 17041992
45 female Senior 17041992
46 female First 17041991
47 male First 17041990
48 male Senior 17041990
49 male First 17041989
50 male Senior 17041989
Tidy solution with golf-coding:
library(tidyr)
contig_table <- mydf %>%
spread(authorship, gender) %>%
#Only need drop_na() if data is incomplete
#Given that you have a lot more rows I assume this will not be needed
drop_na() %$%
table(First, Senior)
Correct answer by Baraliuh on November 30, 2020
Here's a tidy solution by pivot_wider
library(dplyr)
library(tidyr)
newdf <-
mydf %>%
group_by(pmid) %>%
pivot_wider(names_from = authorship,
values_from = gender)
table(newdf$First, newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442
newdf %>%
filter(!is.na(First) & !is.na(Senior))
#> # A tibble: 15 x 3
#> # Groups: pmid [15]
#> pmid First Senior
#> <chr> <chr> <chr>
#> 1 18958667 male male
#> 2 18751818 male male
#> 3 18751811 male male
#> 4 18751810 female female
#> 5 18088800 male male
#> 6 17532688 male male
#> 7 17319096 female male
#> 8 17042011 female male
#> 9 17042010 female male
#> 10 17042006 female male
#> 11 17042002 female male
#> 12 17041993 female female
#> 13 17041992 female female
#> 14 17041990 male male
#> 15 17041989 male male
table(newdf$First, newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442
Your data
mydf <- tibble(
gender = c("male", "male", "male", "male", "male",
"male", "male", "female", "female", "male", "male", "male", "female",
"male", "male", "male", "male", "male", "female", "male", "female",
"male", "female", "male", "female", "male", "female", "female",
"male", "female", "male", "female", "male", "female", "female",
"male", "male", "female", "male", "female", "female", "female",
"female", "female", "female", "female", "male", "male", "male",
"male"),
authorship = c("First", "Senior", "First", "First",
"Senior", "First", "Senior", "First", "Senior", "First", "Senior",
"First", "First", "Senior", "First", "First", "First", "Senior",
"First", "Senior", "First", "Senior", "First", "First", "First",
"First", "First", "First", "Senior", "First", "Senior", "First",
"Senior", "First", "First", "Senior", "First", "First", "Senior",
"First", "First", "First", "Senior", "First", "Senior", "First",
"First", "Senior", "First", "Senior"),
pmid = c("18958667", "18958667",
"18958651", "18751818", "18751818", "18751811", "18751811", "18751810",
"18751810", "18088800", "18088800", "17710072", "17977216", "17762065",
"17611457", "17611433", "17532688", "17532688", "17405310", "17386862",
"17319096", "17319096", "17300028", "17282480", "17177771", "17124681",
"17093906", "17042011", "17042011", "17042010", "17042010", "17042006",
"17042006", "17042003", "17042002", "17042002", "17042001", "17041999",
"17041997", "17041995", "17041994", "17041993", "17041993", "17041992",
"17041992", "17041991", "17041990", "17041990", "17041989", "17041989")
)
Answered by Chuck P on November 30, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP