Bioinformatics Asked on August 19, 2021
My data contain several chip-seq results. I have the peaks called by MACS2.I wanted to only look at those peaks that their size is e.g 500bp to 1000bp. How can I separate those peaks efficiently?
I know one way is to take the peaks.xls file sort them based on the length column and take out the peaks less than 1kb. But I was looking for a more efficient way to do that. Any suggestion?
Many Thanks
You can do it by awk.
awk -v OFS="t" '{ if ( ($3-$2) >= 500 && ($3-$2) <= 1000) { print } }' in.bed > out.bed
Assuming there is no header.
Correct answer by geek_y on August 19, 2021
You can use rtracklayer
from R bioconductor
to import it, filter and do much more stuff, i use the code from this blog for importing the narrow bed file. Below I use an example file since you did not provide yours:
library(rtracklayer)
fl = "zas1_default_peaks.narrowPeak"
extraCols_narrowPeak <- c(signalValue = "numeric", pValue = "numeric",
qValue = "numeric", peak = "integer")
gr <- import(fl, format = "BED",
extraCols = extraCols_narrowPeak)
head(gr)
GRanges object with 6 ranges and 6 metadata columns:
seqnames ranges strand | name score signalValue
<Rle> <IRanges> <Rle> | <character> <numeric> <numeric>
[1] AB325691 8982-9219 * | zas1_default_peak_1 18 1.31697
[2] AB325691 16685-18050 * | zas1_default_peak_2 59 1.51283
[3] I 101971-102188 * | zas1_default_peak_3 17 1.23717
[4] I 102390-103538 * | zas1_default_peak_4 19 1.24967
[5] I 152286-152496 * | zas1_default_peak_5 23 1.37631
[6] I 175025-175577 * | zas1_default_peak_6 69 1.62063
pValue qValue peak
<numeric> <numeric> <integer>
[1] 3.37358 1.89101 146
[2] 8.12841 5.92232 419
[3] 3.26012 1.79545 64
[4] 3.47875 1.97673 708
[5] 3.89921 2.32948 136
[6] 9.34429 6.9608 374
The narrowBed and peaks.xls contains the same information. If you really really need to read in the xls, you can do:
library(GenomicRanges)
peaks = read.table("zas1_default_peaks.xls",comment="#",header=TRUE)
gr = makeGRangesFromDataFrame(peaks,keep.extra.columns=TRUE)
Once you have the genomic ranges object, filtering is the same:
gr[width(gr)>500 & width(gr)<1000]
GRanges object with 208 ranges and 7 metadata columns:
seqnames ranges strand | length abs_summit pileup
<Rle> <IRanges> <Rle> | <integer> <integer> <numeric>
[1] I 175025-175577 * | 553 175399 186.99
[2] I 186807-187489 * | 683 187009 208.66
[3] I 339070-339700 * | 631 339258 222.91
[4] I 431557-432088 * | 532 431734 195.66
Answered by StupidWolf on August 19, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP