Bioinformatics Asked by conchoecia on June 26, 2021
I am writing a python script that requires a reverse complement function to be called on DNA strings of length 1 through around length 30. Line profiling programs indicate that my functions spend a lot of time getting the reverse complements, so I am looking to optimize.
What is the fastest way to get the reverse complement of a sequence in python? I am posting my skeleton program to test different implementations below with DNA string size 17 as an example.
#!/usr/bin/env python
import random
import timeit
global complement
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
DNAlength = 17
#randomly generate 100k bases
int_to_basemap = {1: 'A', 2: 'C', 3: 'G', 4: 'T'}
num_strings = 500000
random.seed(90210)
DNAstrings = ["".join([int_to_basemap[random.randint(1,4)] for i in range(DNAlength)])
for j in range(num_strings)]
#get an idea of what the DNAstrings look like
print(DNAstrings[0:5])
def reverse_complement_naive(seq):
this_complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
return "".join(this_complement.get(base, base) for base in reversed(seq))
def reverse_complement(seq):
return "".join(complement.get(base, base) for base in reversed(seq))
tic=timeit.default_timer()
rcs = [reverse_complement_naive(seq) for seq in DNAstrings]
toc=timeit.default_timer()
baseline = toc - tic
namefunc = {"naive implementation": reverse_complement_naive,
"global dict implementation": reverse_complement}
for function_name in namefunc:
func = namefunc[function_name]
tic=timeit.default_timer()
rcs = [func(seq) for seq in DNAstrings]
toc=timeit.default_timer()
walltime = toc-tic
print("""{}
{:.5f}s total,
{:.1f} strings per second
{:.1f}% increase over baseline""".format(
function_name,
walltime,
num_strings/walltime,
100- ((walltime/baseline)*100) ))
By the way, I get output like this. It varies by the call, of course!
naive implementation
1.83880s total,
271916.7 strings per second
-0.7% increase over baseline
global dict implementation
1.74645s total,
286294.3 strings per second
4.3% increase over baseline
Edit: Great answers, everyone! When I get a chance in a day or two I will add all of these to a test file for the final run. When I asked the question, I had not considered whether I would allow for cython or c extensions when selecting the final answer. What do you all think?
Edit 2: Here are the results of the final simulation with everyone’s implementations. I am going to accept the highest scoring pure python code with no Cython/C. For my own sake I ended up using user172818’s c implementation. If you feel like contributing to this in the future, check out the github page I made for this question.
the runtime of reverse complement implementations.
10000 strings and 250 repetitions
╔══════════════════════════════════════════════════════╗
║ name %inc s total str per s ║
╠══════════════════════════════════════════════════════╣
║ user172818 seqpy.c 93.7% 0.002344 4266961.4 ║
║ alexreynolds Cython (v2) 93.4% 0.002468 4051583.1 ║
║ alexreynolds Cython (v1) 90.4% 0.003596 2780512.1 ║
║ devonryan string 86.1% 0.005204 1921515.6 ║
║ jackaidley bytes 84.7% 0.005716 1749622.2 ║
║ jackaidley bytesstring 83.0% 0.006352 1574240.6 ║
║ global dict 5.4% 0.035330 283046.7 ║
║ revcomp_translateSO 45.9% 0.020202 494999.4 ║
║ string_replace 37.5% 0.023345 428364.9 ║
║ revcom from SO 28.0% 0.026904 371694.5 ║
║ naive (baseline) 1.5% 0.036804 271711.5 ║
║ lambda from SO -39.9% 0.052246 191401.3 ║
║ biopython seq then rc -32.0% 0.049293 202869.7 ║
╚══════════════════════════════════════════════════════╝
I don't know if it's the fastest, but the following provides an approximately 10x speed up over your functions:
import string
tab = string.maketrans("ACTG", "TGAC")
def reverse_complement_table(seq):
return seq.translate(tab)[::-1]
The thing with hashing is that it adds a good bit of overhead for a replacement set this small.
For what it's worth, I added that to your code as "with a translation table" and here is what I got on my workstation:
global dict implementation
1.37599s total,
363374.8 strings per second
3.3% increase over baseline
naive implementation
1.44126s total,
346919.4 strings per second
-1.3% increase over baseline
with a translation table
0.16780s total,
2979755.6 strings per second
88.2% increase over baseline
If you need python 3 rather than python 2, then substitute tab = str.maketrans("ACTG", "TGAC")
for tab = string.maketrans("ACTG", "TGAC")
, since maketrans
is now a static method on the str
type.
For those wondering, using biopython is slower for this (~50% slower than the naive implementation), presumably due to the overhead of converting the strings to Seq
objects. If one were already reading sequences in using biopython, though, I wouldn't be surprised if the performance was much different.
Correct answer by Devon Ryan on June 26, 2021
The most reliable and simplest way is probably using Biopython
:
from Bio.Seq import Seq
def my_reverse_complement(seq):
return Seq(seq).reverse_complement()
print(my_reverse_complement('ATGCGTA'))
# TACGCAT
As Devon has already said here using Biopython
isn't as fast as the naive Python solution, and I also tested that shown here with ipython. However, this is because Biopython
's implementation, although similar to the naive approach, includes other features; it can reverse complement RNA as well as DNA and it will tell you if you're mixing DNA and RNA
Note that if you really want a fast way you could look at Cython
or another python extension. As I edit this now, there are several nice answers taking this approach from user172818 and Alex Reynolds.
Answered by Chris_Rands on June 26, 2021
Here's a Cython approach that might suggest a generic approach to speeding up Python work.
If you're manipulating (ASCII) character strings and performance is a design consideration, then C or Perl are probably preferred options to Python.
In any case, this Cython test uses Python 3.6.3:
$ python --version
Python 3.6.3 :: Anaconda custom (64-bit)
To install Cython, e.g.:
$ conda install -c anaconda cython
The Cython code below seems to offer about the same speed bump as the translation table — perhaps similar code is run under the hood of that.
Two files are needed, starting with setup.py
:
from distutils.core import setup
from Cython.Build import cythonize
setup(
name = 'Reverse complement C test',
ext_modules = cythonize("revcomp_c.pyx"),
)
And then a second file called revcomp_c.pyx
:
from libc.stdlib cimport malloc
cdef int seq_len = 17
cdef char *seq_dest = <char *>malloc(seq_len + 1)
seq_dest[seq_len] = ' '
def reverse_complement_c_v1(str seq):
cdef bytes py_bytes = seq.encode('ascii')
cdef char *seq_src = py_bytes
cdef int i = 0
cdef int d = 0
for i in range(seq_len):
d = seq_len - i - 1
if seq_src[i] == 'A':
seq_dest[d] = 'T'
elif seq_src[i] == 'G':
seq_dest[d] = 'C'
elif seq_src[i] == 'C':
seq_dest[d] = 'G'
elif seq_src[i] == 'T':
seq_dest[d] = 'A'
return seq_dest[:seq_len].decode('ascii')
This can be compiled into a Python module like so:
$ python setup.py build_ext --inplace
Then we can modify the test bench to include this Cython module and the relevant test method:
#!/usr/bin/env python
import random
import timeit
from revcomp_c import reverse_complement_c_v1
global complement
complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
DNAlength = 17
#randomly generate 100k bases
int_to_basemap = {1: 'A', 2: 'C', 3: 'G', 4: 'T'}
num_strings = 500000
random.seed(90210)
DNAstrings = ["".join([int_to_basemap[random.randint(1,4)] for i in range(DNAlength)])
for j in range(num_strings)]
#get an idea of what the DNAstrings look like
print(DNAstrings[0:5])
def reverse_complement_naive(seq):
this_complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}
return "".join(this_complement.get(base, base) for base in reversed(seq))
def reverse_complement(seq):
return "".join(complement.get(base, base) for base in reversed(seq))
tic=timeit.default_timer()
rcs = [reverse_complement_naive(seq) for seq in DNAstrings]
toc=timeit.default_timer()
baseline = toc - tic
namefunc = {"naive implementation": reverse_complement_naive,
"global dict implementation": reverse_complement,
"Cython implementation (v1)": reverse_complement_c_v1}
for function_name in namefunc:
func = namefunc[function_name]
tic=timeit.default_timer()
rcs = [func(seq) for seq in DNAstrings]
toc=timeit.default_timer()
walltime = toc-tic
print("""{}
{:.5f}s total,
{:.1f} strings per second
{:.1f}% increase over baseline""".format(
function_name,
walltime,
num_strings/walltime,
100- ((walltime/baseline)*100) ))
A sample run:
$ python ./revcomp_test.py
['ACTGCAATAGACCACGC', 'CAGCTTGAGCCATTAAC', 'GGCCCAAGAGTTCGAAC', 'CGACTGTCTCGAATTGT', 'ATCCGCTATGCGCCGTC']
naive implementation
1.73295s total,
288525.0 strings per second
-2.5% increase over baseline
global dict implementation
1.70896s total,
292575.5 strings per second
-1.0% increase over baseline
Cython implementation (v1)
0.21458s total,
2330158.7 strings per second
87.3% increase over baseline
One easy way to speed this up is to use a static const unsigned char
array as an ASCII lookup table, which maps a residue directly to its complement. This would replace the nest of if
statements and probably give a nice little boost (and it appears it does, making it among the best performers so far!).
Here is my fast implementation of a reverse complement function in C:
https://gist.github.com/alexpreynolds/4f75cab4350e9d937f4a
You might be able to use this directly in Python via the subprocess
library. Outsourcing the reverse complement step to a utility written in C will almost always beat the best that Python can do, and you can do nice and important things like bounds checking etc. without losing much speed.
It's unclear how "pure" the answer needs to be, but making a system call from Python seems fair if you're processing strings and your goal is performance.
Another direction to take may be to look at multithreading, if you don't need ordered output. If you have many thousands of sequences stored in memory, you could split an array of sequences up into smaller arrays by use of offsets or array indices. Each thread would work on "rc"-ing sequences in its own piece of the array.
Answered by Alex Reynolds on June 26, 2021
Use a bytearray
instead of a string
and then employ maketrans
to translate
You do not need the more advanced string encoding capabilities of string
to store a string of bases, but you're still paying for it in performance. Devon Ryan's suggestion of maketrans
is the huge improvement, 10x faster than your naive implementation. Using the same approach, but swapping everything out for byte
s allows a further 40% speed improvement, however:
# DNA sequence generation also needs to be swapped to bytes
tab_b = bytes.maketrans(b"ACTG", b"TGAC")
def reverse_complement_bytes(seq):
return seq.translate(tab_b)[::-1]
In my comparison:
Table implementation
0.24339s total,
2054315.7 strings per second
82.5% increase over baseline
Bytes table implementation
0.14985s total,
3336755.9 strings per second
89.2% increase over table
Answered by Jack Aidley on June 26, 2021
Another python extension but without cython. The source code is available at the bottom of this answer or from this gist. On Mac with Python3:
string
0.38166s total,
1310067.4 strings per second
84.4% increase over baseline
seqpy
0.20524s total,
2436182.7 strings per second
91.6% increase over baseline
On Linux with Python2 (seqpy is the first):
seqpy
0.10960s total,
4561842.5 strings per second
93.3% increase over baseline
string
0.17811s total,
2807330.9 strings per second
89.1% increase over baseline
This is file setup.py
:
from setuptools import setup, Extension
seqpy_module = Extension('seqpy', sources = ["seqpy.c"])
setup(
name = "seqpy",
description = "seqpy",
ext_modules = [seqpy_module])
This is file seqpy.c
:
#include <stdlib.h>
#include <stdint.h>
#include <Python.h>
static unsigned char seq_comp_table[256] = {
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 'T', 'V', 'G', 'H', 'E', 'F', 'C', 'D', 'I', 'J', 'M', 'L', 'K', 'N', 'O',
'P', 'Q', 'Y', 'S', 'A', 'A', 'B', 'W', 'X', 'R', 'Z', 91, 92, 93, 94, 95,
64, 't', 'v', 'g', 'h', 'e', 'f', 'c', 'd', 'i', 'j', 'm', 'l', 'k', 'n', 'o',
'p', 'q', 'y', 's', 'a', 'a', 'b', 'w', 'x', 'r', 'z', 123, 124, 125, 126, 127,
128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159,
160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175,
176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191,
192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,
208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223,
224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239,
240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255
};
static PyObject *seqpy_revcomp(PyObject *self, PyObject *args)
{
PyObject *r;
char *seq, *rev;
int i, len;
PyArg_ParseTuple(args, "s#", &seq, &len);
rev = (char*)malloc(len);
for (i = 0; i < len; ++i)
rev[len - i - 1] = seq_comp_table[(uint8_t)seq[i]];
r = Py_BuildValue("s#", rev, len);
free(rev);
return r;
}
static PyMethodDef seqpy_methods[] = {
{"revcomp", seqpy_revcomp, METH_VARARGS, "Reverse complement a DNA sequence"},
{NULL, NULL, 0, NULL}
};
#if PY_MAJOR_VERSION >= 3
static struct PyModuleDef seqpy_module = { PyModuleDef_HEAD_INIT, "seqpy", NULL, -1, seqpy_methods };
PyMODINIT_FUNC PyInit_seqpy(void) { return PyModule_Create(&seqpy_module); }
#else
PyMODINIT_FUNC initseqpy(void) { Py_InitModule3("seqpy", seqpy_methods, NULL); }
#endif
To compile and test:
python setup.py build_ext -i
python -c 'import seqpy; print(seqpy.revcomp("GGGTT"))'
Answered by user172818 on June 26, 2021
Here is a revision of my original Cython answer which incorporates my suggestion to use a char
lookup array:
from libc.stdlib cimport malloc
cdef int seq_len = 17
cdef char *seq_dest = <char *>malloc(seq_len + 1)
seq_dest[seq_len] = ' '
cdef char *basemap = [ ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'T', ' ', 'G', ' ', ' ', ' ', 'C', ' ', ' ', ' ', ' ', ' ', ' ', 'N', ' ', ' ', ' ', ' ', ' ', 'A', 'A', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 't', ' ', 'g', ' ', ' ', ' ', 'c', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', 'a', 'a' ]
def reverse_complement_c_v2(str seq):
cdef bytes py_bytes = seq.encode('UTF-8')
cdef char *seq_src = py_bytes
cdef int i = 0
for i in range(seq_len):
seq_dest[seq_len - i - 1] = basemap[<int>seq_src[i]]
return seq_dest[:seq_len].decode('UTF-8')
In testing under OS X, Python 3.6.3:
$ python ./revcomp.py
['ACTGCAATAGACCACGC', 'CAGCTTGAGCCATTAAC', 'GGCCCAAGAGTTCGAAC', 'CGACTGTCTCGAATTGT', 'ATCCGCTATGCGCCGTC']
...
Cython implementation (v1)
0.20664s total,
2419626.4 strings per second
88.3% increase over baseline
Cython implementation (v2)
0.14170s total,
3528668.5 strings per second
92.0% increase over baseline
Using my lookup array approach ("v2") adds a very decent performance bump over using if
blocks ("v1"), and you can keep everything as a Python string.
Answered by Alex Reynolds on June 26, 2021
Since at least version 1.71
of biopython you can use Bio.Seq.reverse_complement
, which also works on plain strings natively (no conversion to Seq objects). On my mac I get 800k strings converted with that implementation ("biopython just rc") when using the benchmark.
10000 strings and 250 repetitions
percent increase over baseline seconds total strings per second
name
alexreynolds Cython (v2) 94.0% 0.002491 4015188.7
user172818 seqpy.c 93.2% 0.002833 3529944.4
alexreynolds Cython (v1) 90.7% 0.003885 2573863.7
devonryan string 86.0% 0.005804 1722983.5
jackaidley bytes 85.7% 0.005959 1678085.3
jackaidley bytesstring 82.1% 0.007459 1340693.2
biopython just rc 70.5% 0.012275 814638.6
revcomp_translateSO 49.9% 0.020822 480254.6
revcom from SO 41.7% 0.024253 412320.2
string_replace 35.2% 0.026946 371107.3
lambda from SO 10.7% 0.037111 269464.0
global dict 0.0% 0.041556 240641.7
biopython seq then rc -30.9% 0.054430 183721.9
naive (baseline) -0.9% 0.041962 238313.6
Answered by winni2k on June 26, 2021
def revcomplement(s):
s = s.upper()
c = { "A":"T","C":"G","T":"A","G":"C"}
co = str()
for i in s:
co = c[i] + co
return co
Answered by Amardeep Lokhande on June 26, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP