Selecting rows by partial match

Question

I have a data frame one column is sample name like
Sample

TCGA-VR-A8EX-01A-11D-A36I-01
TCGA-L7-A56G-01A-21D-A25X-01
TCGA-VR-A8ER-01A-11D-A36I-01
TCGA-JY-A6FB-01A-11D-A33D-01
TCGA-2H-A9GO-01A-11D-A37B-01
TCGA-Z6-AAPN-01A-11D-A402-01
TCGA-L5-A4OP-01A-11D-A25X-01
TCGA-L5-A8NV-01A-11D-A37B-01
TCGA-R6-A8W5-01B-11D-A37B-01
TCGA-S8-A6BW-01A-11D-A31T-01
TCGA-R6-A6Y0-01B-11D-A33D-01
TCGA-R6-A8WC-01A-11D-A37B-01
TCGA-S8-A6BV-01A-21D-A31T-01
TCGA-L5-A4OJ-01A-11D-A25X-01
TCGA-LN-A7HV-01A-21D-A350-01

I want to subset samples which have partial match with these name
TCGA-IC-A6RE
TCGA-IG-A4QS
TCGA-JY-A6F8
TCGA-JY-A6FB
TCGA-L5-A43E
TCGA-L5-A4OG
TCGA-L5-A4OH
TCGA-L5-A4OJ

I have used
data[grepl(c("TCGA-IC-A6RE", "TCGA-IG-A4QS", "TCGA-JY-A6F8", "TCGA-JY-A6FB", "TCGA-L5-A43E", "TCGA-L5-A4OG", "TCGA-L5-A4OH", "TCGA-L5-A4OJ", "TCGA-L5-A4ON", "TCGA-L5-A4OS", "TCGA-L5-A4OW", "TCGA-L5-A4OX", "TCGA-L5-A88T", "TCGA-Q9-A6FW"),data$Sample),]

But says that
Warning message:
In grepl(c("TCGA-IC-A6RE", "TCGA-IG-A4QS", "TCGA-JY-A6F8", "TCGA-JY-A6FB",  :
  argument 'pattern' has length > 1 and only the first element will be used

Likely only the first entry from the vector being grepped
Edit: As @StupidWolf suggested, I have tried:
> data[substr(data$Sample, 1, 12) %in% 
    c("TCGA-IC-A6RE", "TCGA-IG-A4QS", "TCGA-JY-A6F8", "TCGA-JY-A6FB",
      "TCGA-L5-A43E", "TCGA-L5-A4OG", "TCGA-L5-A4OH", "TCGA-L5-A4OJ",
      "TCGA-L5-A4ON", "TCGA-L5-A4OS", "TCGA-L5-A4OW", "TCGA-L5-A4OX",
      "TCGA-L5-A88T", "TCGA-Q9-A6FW"),]

This works, thanks.

gringer · Accepted Answer

Answer from @stupidwolf, converted from comment:
You cannot use grepl with a vector, it must be a pattern.
I can suggest two things and you just put them together i hope, to know if something is found in a vector, you can use data$Sample %in% c("TCGA-IC-A6RE", "TCGA-IG-A4QS") etc ..
Of course data$Sample will not be in there, you need a bit of manipulation, so since your samples follow a certain pattern, you can use substr(data$Sample,....) . If you do substr(data$Sample,....) %in% c(.....) , this will give you the selection.

ATpoint · Answer

Not strictly bioinformatics, but a common problem:

grep assumes a regular expression, not a vector of patterns.
Assuming your patterns are stored as a vector, so e.g.

patt <- c("pattern1", "pattern2", "pattern3")

you first have to write them as a regex:

patt.regex <- paste(patt, collapse = "|")

Output of patt.regex is then "pattern1|pattern2|pattern3" which is now ready for grep or grepl:

grep(patt.regex, data$Sample)

For exact matches it would be paste(paste0("^", patt, "$"), collapse = "|")
to get "^pattern1$|^pattern2$|^pattern3$".

zx8754 · Answer

When the string follows certain format there is no need for partial match, see this example:

# example data
data <- read.table(text = "
Sample
TCGA-VR-A8EX-01A-11D-A36I-01
TCGA-L7-A56G-01A-21D-A25X-01
TCGA-VR-A8ER-01A-11D-A36I-01
TCGA-JY-A6FB-01A-11D-A33D-01
", header = TRUE, stringsAsFactors = FALSE)

x <- c("TCGA-VR-A8EX", "TCGA-VR-A8ER")

data[ substr(data$Sample, 1, 12) %in% x, "Sample", drop = FALSE]
#                         Sample
# 1 TCGA-VR-A8EX-01A-11D-A36I-01
# 3 TCGA-VR-A8ER-01A-11D-A36I-01

Or when sample IDs lengths are not the same we can split on "-" then paste it back first 3 items, then subset as usual with %in%:

data[ sapply(strsplit(data$Sample, split = "-"), 
             function(i) paste(i[1:3], collapse = "-")
             ) %in% x, "Sample", drop = FALSE]
#                         Sample
# 1 TCGA-VR-A8EX-01A-11D-A36I-01
# 3 TCGA-VR-A8ER-01A-11D-A36I-01

Selecting rows by partial match

3 Answers

Add your own answers!

Ask a Question