Bioinformatics Asked by krushnach Chandra on August 22, 2021
I want to categorize promoters based upon high, intermediate and low-CpG content (high-CpG-density promoters (HCPs), intermediate-CpG-density promoters(ICPs), and low-CpG-density promoters (LCPs)).
So the data I have is for promoter which i have annotated 1000+/- around TSS and taken them as something less than 1kb is promoter region and beyond distal.
In term of tool i have used chipseeker to annotate and do the above step.
So now if i have to find the CpG density as i have mentioned above how do i get that?
Below I import the peaks call in narrowbed format from macs2 into a GRanges object. If you have a bed file, you can just use rtracklayer for it:
library(rtracklayer)
library(BSgenome.Hsapiens.UCSC.hg38)
peaks_narrowbed = "https://www.encodeproject.org/files/ENCFF846JAO/@@download/ENCFF846JAO.bed.gz"
extraCols <- c(signalValue = "numeric", pValue = "numeric",
qValue = "numeric", peak = "integer")
gr <- import(peaks_narrowbed, format = "BED",extraCols = extraCols)
head(gr)
GRanges object with 6 ranges and 6 metadata columns:
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
[1] chr1 10049-10326 * |
[2] chr1 180686-181040 * |
[3] chr1 186648-186903 * |
[4] chr1 191293-191604 * |
[5] chr1 267872-268146 * |
[6] chr1 586110-586324 * |
name score signalValue
<character> <numeric> <numeric>
[1] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_1 135 7.10907
[2] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_2 309 11.65888
[3] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_3 81 5.40289
[4] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_4 56 4.54981
[5] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_5 145 7.39343
[6] atacseq_170613.t8_rep1.dedup.masked.sorted_peak_6 116 6.54035
Then we extract the sequences that are within these regions, you can of course redefine the region by extending it from the peak:
library(Biostrings)
SEQ = getSeq(Hsapiens,gr)
head(SEQ)
A DNAStringSet instance of length 6
width seq
[1] 278 TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...CCCCAACCCCAACCCCAACCCCAACCCCAACCC
[2] 355 CTAACCCCTAACCCCTAACCCTAACCCTACCCTA...CTCAGCCGGCCCGCCCGCCCGGGTCTGACCTGA
[3] 256 GCAACCACCTGAGCGCGGGCATCCTGTGTGCAGA...CCCCTTCTTTCCATTGGTTTAATTAGGAACGGG
[4] 312 CAATCAGCAGGGACCGTGCACTCTCTTGGAGCCA...CCCTGCCCCTGTCTCTTCCGTGCAGGAGGAGCA
[5] 275 TGGATGTGTGCATTTTCCTGAGAGGAAAGCTTTC...TTGTTTAGTTTGCGTTGTGTTTCTCCAACTTTG
[6] 215 GAGGCTGAAGGAGACTGATGTGGTTTCTCCTCAG...CCAACTTTGTGCCTCATCAGGAAAAGCTTTGGA
This is the CG, and you can decide how you want to bin your regions:
freq = dinucleotideFrequency(SEQ,as.prob=TRUE)
hist(freq[,"CG"],br=100)
Correct answer by StupidWolf on August 22, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP