Find CRISPR PAM-sites with python

Question

I am a college student and just started bioinformatics. I am trying to write a script (not for school) that finds all potential NGG and TTTN protospacer adjacent motif sequences in a genome string. I wrote the following script:
TargetFile = open('file.fasta')
readTF = TargetFile.read()
useTF = "".join(readTF[66:].split())

#no_of_basepairs = len(useTF)

sites = {"ngg_strand" : [],
         "ngg_antistrand" : [],     #location counted from reverse direction
         "tttn_strand" : [],
         "tttn_antistrand" : []}    #location counted from reverse direction

for i in range(0, len(useTF)-2):
    gg_mer = useTF[i:i + 2]
    ttt_mer = useTF[i:i + 3]

if gg_mer == "GG":
        sites["ngg_strand"].append(i)
    elif gg_mer == "CC":
        #look for reverse complement, record location on antistrand from reverse direction
        sites["ngg_antistrand"].append(len(useTF)-i-2)
    elif ttt_mer == "TTT":
        sites["tttn_strand"].append(i+1)
    elif ttt_mer == "AAA":
        # look for reverse complement, record location on antistrand from reverse direction
        sites["tttn_antistrand"].append(len(useTF)-i-2)

print(len(sites["ngg_strand"]))
print(useTF.count("GG"))

At the bottom I used the .count() method to compare results. I noticed that the method returns a smaller number than my code. I believe this is due to the .count() method not counting overlap, whereas my code does. I.e. in the string "NNNGGGGNNN" my code would count 3 occurences of the "GG" substring, whereas the .count() method returns 2 occurences.
My Questions:

If I want to find all possible PAM sites, which way of counting is correct? I believe my way would be correct, since as far as I know the RNA molecule doesn't advance in "triplet-steps". However, I am very much new to this and might be missing something.

If the reverse complement is found, is the location usually recorded from the same direction as on the strand, or in the reverse direction (as I implemented in the code)?

Like I said, I am new to this so if you see any other glaring problems I would appreciate any and all help. :-)

Thank you!

M__ · Accepted Answer

The easiest solution is finditer
use re

# Compile you search term where [A|T|G|C] means A or G or C or T 
p = re.compile("[A|T|G|C]GG")

iterator = p.finditer(useTF)
for match in iterator:
    print(match.span())

** Example Output**
(0, 2)
(22, 24)
(29, 31)

Easy does it. Its worth looking at the documentation for re here, https://docs.python.org/3/howto/regex.html for alternative approaches using match or search below. finditer is definitely the solution in context.

I support your coding approach because it provides lots of flexibility. For example, you might want to target GG or TTT around a given locus and you can easily modify the code to accomodate this. Someone elses code would be less flexible.
In summary, I kinda like your substrings and do like the count method. I'm not a fan of your == in this instance and your C-style matching should be substituted for reg-ex which in Python is implemented via re.
The basic issues are,

Biopython would be used via AlignIO or SeqIO. This will allow any given format to be analysed. Yours approach is okay but prone to bugs (below). [probably a minor issue]
Your sequence is in some sort of weird format, like extended phylip format and it is not fasta (that wouldn't work). It needs verifying because removing the first 65 characters might remove bits of sequence [probably a minor issue]
Chopping of -2 from the length ... I dunno I'd check out -1 [minor]
C-style matching. I don't like this. I would definitely use re. One solution uses match or search, but the solution above uses finditer, Reg-ex will allow to search for NGG and TTTN, which you are not doing
I think you are looping a locus of 3 and 4 bp and I'm not sure how the C-style == would match this because thats 2 and 3 bp respectively. Just thoughts. You will have 'edge effects' at the beginning and end of sequences, e.g. TTT 3' will match but that isn't TTTN.
I like your count method, but I think finditer is better because it gives you location information
The dictionary is sort of fine. Personally, I would dump the dictionary into a pandas dataframe This way the results are output as e.g. CSV, spreadsheet flat-files for your colleagues to view.
the if and elif would be placed in a subroutine, this is good practice to convert to an object orientated approach.

Find CRISPR PAM-sites with python

One Answer

Add your own answers!

Ask a Question