Bioinformatics Asked by Igor Filippov on July 9, 2021
I’ve got the reference genome with Python like so:
sequences_by_chr = {}
with gzip.open("data/Zea_mays.B73_RefGen_v4.dna.toplevel.fa.gz", "rt") as f:
for seq in SeqIO.parse(f, "fasta"):
sequences_by_chr[seq.id] = seq.seq
I’ve also parsed a GFF3 annotation and would like to get the gene sequence based on said annotation.
Suppose the following:
chromosome = 1
strand = "-"
start = 45286331
end = 45290076
Since it’s on the "minus" strand, should I do the following:
sequences_by_chr[chromosome].complement()[start:end]
or should I use the reverse_complement()
?
I’m completely confused by this, would appreciate any hint!
You can parse the GFF3 to generate a bed-like file with gene ranges and use that with bedtools getfasta
to obtain a multi-FASTA file with gene sequences. An example case is shown below:
## get fasta and gff3 files
wget ftp://ftp.ensembl.org//pub/current_gff3/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.100.gff3.gz
wget ftp://ftp.ensembl.org:/pub/current_fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
## unzip fasta
gunzip Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz
## generate genes.bed
zgrep -v '^#' Saccharomyces_cerevisiae.R64-1-1.100.gff3.gz | awk 'BEGIN{FS="t";OFS="t"}($3=="gene"){print $1,$4-1,$5,"name",1000,$7}' > genes.bed
## create gene fasta
bedtools getfasta -fi Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa -bed genes.bed -s -fo genes.fa
```
Answered by vkkodali on July 9, 2021
DNA sequence is always provided on the plus strand, if the gff tells you it is on the minus strand, you need to reverse complement it, because the start of the gene in your example will be 45290076 and the end will now be 45286331.
I can give you an example below, this is a potential coding sequence from yeast, no introns and on the minus strand:
gunzip -c test.fa.gz
>1
TCATCCCCTTATATAAGATGGGAAACGAAGGTAAAAGATTAACGATAGCAT
If we read it in like you did, and if you apply the complement, this is not a coding sequence.
chromosome="1"
start = 1
end =51
sequences_by_chr[chromosome].complement()[(start-1):end]
Seq('AGTAGGGGAATATATTCTACCCTTTGCTTCCATTTTCTAATTGCTATCGTA', SingleLetterAlphabet())
Getting the sequence followed by reverse complement gives you a coding sequence, starting with ATG ending with TGA:
sequences_by_chr["1"][(start-1):end].reverse_complement()
Seq('ATGCTATCGTTAATCTTTTACCTTCGTTTCCCATCTTATATAAGGGGATGA', SingleLetterAlphabet())
Answered by StupidWolf on July 9, 2021
You should do:
if sequences_by_chrom[strand] == "-":
rev_comp( sequences_by_chr[chromosome][start:end] )
else:
sequences_by_chr[chromosome][start:end]
where
def rev_comp(sequence):
d = {'A':'T', 'T':'A','C':'G','G':'C'}
return ''.join([d[i] for i in sequence][::-1])
Answered by aerijman on July 9, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP