Bioinformatics Asked by Kendal B. on May 29, 2021
Problem Statement
The GFF3 format is a commonly-used one in bioinformatics for representing sequence
annotation. You can find the specification here:
http://www.sequenceontology.org/gff3.shtml
The genome and annotation for Saccharomyces cerevisiae S288C is on the class server here:
/home/jorvis1/Saccharomyces_cerevisiae_S288C.annotation.gff
Note that this same file has both the annotation feature table and the FASTA sequence for the
molecules referenced. (See the ‘##FASTA’ directive in the specification.)
Within the feature table another column of note is the 9 th , where we can store any key=value
pairs relevant to that row’s feature such as ID, Ontology_term or Note.
Your task is to write a GFF3 feature exporter. A user should be able to run your script like this:
$ export_gff3_feature.py --source_gff=/path/to/some.gff3 --type=gene --attribute=ID --value=YAR003W
There are 4 arguments here that correspond to values in the GFF3 columns. In this case, your
script should read the path to a GFF3 file, find any gene (column 3) which has an ID=YAR003W
(column 9). When it finds this, it should use the coordinates for that feature (columns 4, 5 and 7)
and the FASTA sequence at the end of the document to return its FASTA sequence.
Your script should work regardless of the parameter values passed, warning the user if no
features were found that matched their query. (It should also check and warn if more than one
feature matches the query.)
The output should just be printed on STDOUT (no writing to a file is necessary.) It should have a
header which matches their query, like this:
gene:ID:YAR003W
…. sequence here …
As an extra challenge, you can format the sequence portion of the FASTA output as 60-characters
per line, which follows the standard.
So far I have this below. I’m not even sure if this is correct to start with:
#!/usr/bin/env python3
for line in open("Saccharomyces_cerevisiae_S288C.annotation.gff"):
line1 = line.rstrip()
if line.startswith("#"):
continue
column = line1.split("t")
if len(column) != 9:
continue
id = column[8]
type = column[2]
Can anyone help me with this? I’m totally lost. Thank you!
Your answer looks like a good start. You'll probably want to use sys.argv, argparse, or another argument parsing tool to read in options from the command line, then compare them to the fields (e.g. the "type" argument gets compared with the type column).
Given that attributes can be arranged in an arbitrary order, you'll probably need to read all of them from each line and store in a dict in order to determine whether the target is present. A regular expression search is another way to do that.
There are a few libraries available (e.g. from biopython) for reading and subsetting FASTA records.
Answered by gringer on May 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP