Bioinformatics Asked by heibai on January 21, 2021
I am a college student and just started bioinformatics. I am trying to write a script (not for school) that finds all potential NGG and TTTN protospacer adjacent motif sequences in a genome string. I wrote the following script:
TargetFile = open('file.fasta')
readTF = TargetFile.read()
useTF = "".join(readTF[66:].split())
#no_of_basepairs = len(useTF)
sites = {"ngg_strand" : [],
"ngg_antistrand" : [], #location counted from reverse direction
"tttn_strand" : [],
"tttn_antistrand" : []} #location counted from reverse direction
for i in range(0, len(useTF)-2):
gg_mer = useTF[i:i + 2]
ttt_mer = useTF[i:i + 3]
if gg_mer == "GG":
sites["ngg_strand"].append(i)
elif gg_mer == "CC":
#look for reverse complement, record location on antistrand from reverse direction
sites["ngg_antistrand"].append(len(useTF)-i-2)
elif ttt_mer == "TTT":
sites["tttn_strand"].append(i+1)
elif ttt_mer == "AAA":
# look for reverse complement, record location on antistrand from reverse direction
sites["tttn_antistrand"].append(len(useTF)-i-2)
print(len(sites["ngg_strand"]))
print(useTF.count("GG"))
At the bottom I used the .count() method to compare results. I noticed that the method returns a smaller number than my code. I believe this is due to the .count() method not counting overlap, whereas my code does. I.e. in the string "NNNGGGGNNN" my code would count 3 occurences of the "GG" substring, whereas the .count() method returns 2 occurences.
My Questions:
If I want to find all possible PAM sites, which way of counting is correct? I believe my way would be correct, since as far as I know the RNA molecule doesn’t advance in "triplet-steps". However, I am very much new to this and might be missing something.
If the reverse complement is found, is the location usually recorded from the same direction as on the strand, or in the reverse direction (as I implemented in the code)?
Like I said, I am new to this so if you see any other glaring problems I would appreciate any and all help. 🙂
Thank you!
The easiest solution is finditer
use re
# Compile you search term where [A|T|G|C] means A or G or C or T
p = re.compile("[A|T|G|C]GG")
iterator = p.finditer(useTF)
for match in iterator:
print(match.span())
** Example Output**
(0, 2)
(22, 24)
(29, 31)
Easy does it. Its worth looking at the documentation for re
here, https://docs.python.org/3/howto/regex.html for alternative approaches using match
or search
below. finditer
is definitely the solution in context.
I support your coding approach because it provides lots of flexibility. For example, you might want to target GG or TTT around a given locus and you can easily modify the code to accomodate this. Someone elses code would be less flexible.
In summary, I kinda like your substrings and do like the count method. I'm not a fan of your ==
in this instance and your C-style matching should be substituted for reg-ex which in Python is implemented via re
.
The basic issues are,
re
. One solution uses match
or search
, but the solution above uses finditer
, Reg-ex will allow to search for NGG and TTTN, which you are not doing==
would match this because thats 2 and 3 bp respectively. Just thoughts. You will have 'edge effects' at the beginning and end of sequences, e.g. TTT 3' will match but that isn't TTTN.count
method, but I think finditer
is better because it gives you location informationpandas
dataframe This way the results are output as e.g. CSV, spreadsheet flat-files for your colleagues to view.if
and elif
would be placed in a subroutine, this is good practice to convert to an object orientated approach.Correct answer by M__ on January 21, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP