Bioinformatics Asked on June 3, 2021
I have a data frame one column is sample name like
Sample
TCGA-VR-A8EX-01A-11D-A36I-01
TCGA-L7-A56G-01A-21D-A25X-01
TCGA-VR-A8ER-01A-11D-A36I-01
TCGA-JY-A6FB-01A-11D-A33D-01
TCGA-2H-A9GO-01A-11D-A37B-01
TCGA-Z6-AAPN-01A-11D-A402-01
TCGA-L5-A4OP-01A-11D-A25X-01
TCGA-L5-A8NV-01A-11D-A37B-01
TCGA-R6-A8W5-01B-11D-A37B-01
TCGA-S8-A6BW-01A-11D-A31T-01
TCGA-R6-A6Y0-01B-11D-A33D-01
TCGA-R6-A8WC-01A-11D-A37B-01
TCGA-S8-A6BV-01A-21D-A31T-01
TCGA-L5-A4OJ-01A-11D-A25X-01
TCGA-LN-A7HV-01A-21D-A350-01
I want to subset samples which have partial match with these name
TCGA-IC-A6RE
TCGA-IG-A4QS
TCGA-JY-A6F8
TCGA-JY-A6FB
TCGA-L5-A43E
TCGA-L5-A4OG
TCGA-L5-A4OH
TCGA-L5-A4OJ
I have used
data[grepl(c("TCGA-IC-A6RE", "TCGA-IG-A4QS", "TCGA-JY-A6F8", "TCGA-JY-A6FB", "TCGA-L5-A43E", "TCGA-L5-A4OG", "TCGA-L5-A4OH", "TCGA-L5-A4OJ", "TCGA-L5-A4ON", "TCGA-L5-A4OS", "TCGA-L5-A4OW", "TCGA-L5-A4OX", "TCGA-L5-A88T", "TCGA-Q9-A6FW"),data$Sample),]
But says that
Warning message:
In grepl(c("TCGA-IC-A6RE", "TCGA-IG-A4QS", "TCGA-JY-A6F8", "TCGA-JY-A6FB", :
argument 'pattern' has length > 1 and only the first element will be used
Likely only the first entry from the vector being grepped
Edit: As @StupidWolf suggested, I have tried:
> data[substr(data$Sample, 1, 12) %in%
c("TCGA-IC-A6RE", "TCGA-IG-A4QS", "TCGA-JY-A6F8", "TCGA-JY-A6FB",
"TCGA-L5-A43E", "TCGA-L5-A4OG", "TCGA-L5-A4OH", "TCGA-L5-A4OJ",
"TCGA-L5-A4ON", "TCGA-L5-A4OS", "TCGA-L5-A4OW", "TCGA-L5-A4OX",
"TCGA-L5-A88T", "TCGA-Q9-A6FW"),]
This works, thanks.
Answer from @stupidwolf, converted from comment:
You cannot use grepl with a vector, it must be a pattern.
I can suggest two things and you just put them together i hope, to know if something is found in a vector, you can use data$Sample %in% c("TCGA-IC-A6RE", "TCGA-IG-A4QS")
etc ..
Of course data$Sample
will not be in there, you need a bit of manipulation, so since your samples follow a certain pattern, you can use substr(data$Sample,....)
. If you do substr(data$Sample,....) %in% c(.....)
, this will give you the selection.
Correct answer by gringer on June 3, 2021
Not strictly bioinformatics, but a common problem:
grep
assumes a regular expression, not a vector of patterns.
Assuming your patterns are stored as a vector, so e.g.
patt <- c("pattern1", "pattern2", "pattern3")
you first have to write them as a regex:
patt.regex <- paste(patt, collapse = "|")
Output of patt.regex
is then "pattern1|pattern2|pattern3"
which is now ready for grep
or grepl
:
grep(patt.regex, data$Sample)
For exact matches it would be paste(paste0("^", patt, "$"), collapse = "|")
to get "^pattern1$|^pattern2$|^pattern3$"
.
Answered by ATpoint on June 3, 2021
When the string follows certain format there is no need for partial match, see this example:
# example data
data <- read.table(text = "
Sample
TCGA-VR-A8EX-01A-11D-A36I-01
TCGA-L7-A56G-01A-21D-A25X-01
TCGA-VR-A8ER-01A-11D-A36I-01
TCGA-JY-A6FB-01A-11D-A33D-01
", header = TRUE, stringsAsFactors = FALSE)
x <- c("TCGA-VR-A8EX", "TCGA-VR-A8ER")
data[ substr(data$Sample, 1, 12) %in% x, "Sample", drop = FALSE]
# Sample
# 1 TCGA-VR-A8EX-01A-11D-A36I-01
# 3 TCGA-VR-A8ER-01A-11D-A36I-01
Or when sample IDs lengths are not the same we can split on "-" then paste it back first 3 items, then subset as usual with %in%
:
data[ sapply(strsplit(data$Sample, split = "-"),
function(i) paste(i[1:3], collapse = "-")
) %in% x, "Sample", drop = FALSE]
# Sample
# 1 TCGA-VR-A8EX-01A-11D-A36I-01
# 3 TCGA-VR-A8ER-01A-11D-A36I-01
Answered by zx8754 on June 3, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP