Code Review Asked on December 10, 2021
I have a file with over 2 million lines. I want to get the 8th column of the file and find the pattern that contains "DP=" and "QD=" and then save all the values in an array, get the mean, median and create a figure. I wrote the following script. This script goes through each line in a loop, find the pattern and saves its value in a numpy array. This is however very slow. I was wondering if there is another algorithm that can speed up the script? Thanks!
#!/usr/bin/python
import sys
import re
import gzip
import matplotlib.pyplot as plt
import numpy as np
dp_array = np.empty(0, dtype=float)
qd_array = np.empty(0, dtype=float)
c = 0
with gzip.open(sys.argv[1], 'rb') as vcf:
for line in vcf:
line = line.decode() # decode byte into string
if not line.startswith("#"):
line = line.strip("n").split("t")
c += 1
#print(c)
#print(line)
info = line[7]
# print(info)
pattern1 = re.compile(r'DP=([^;]+)')
depth = re.search(pattern1, info).group(1)
dp_array = np.append(dp_array, depth)
#print(dp_array)
pattern2 = re.compile(r'QD=([^;]+)')
qd = re.search(pattern2, info).group(1)
qd_array = np.append(qd_array, qd)
#print(qd_array)
# Calculate the mean and median
depth_mean = np.array(dp_array)
qd_mean = np.array(qd_array)
print("mean", depth_mean)
print("median", qd_mean)
depth_median = np.array(dp_array)
qd_median = np.array(qd_array)
print("mean", depth_median)
print("median", qd_median)
# plot
plt.hist(depth_mean, bins = 20, facecolor='blue', alpha=0.5)
plt.hist(qd_mean, bins = 20, facecolor='blue', alpha=0.5)
plt.savefig("dp.qd.pdf")
Example of data:
scaffold53 1174 . C G 196.03 PASS AC=1;AF=0.100;AN=10;BaseQRankSum=-6.360e-01;ClippingRankSum=0.00;DP=2865;ExcessHet=23.2938;FS=3.961;InbreedingCoeff=-0.3494;MQ=74.21;MQRankSum=11.081;NEGATIVE_TRAIN_SITE;QD=6.53;ReadPosRankSum=0.775;SOR=0.053;VQSLOD=5.67;culprit=QD;set=variant20 GT:AD:DP:GQ:PL 0/0:761,0:761:0:0,0,15737 0/0:779,0:779:0:0,0,13278 0/0:756,0:756:0:0,0,14720 0/0:539,0:539:0:0,0,10431 0/1:17,13:30:99:217,0,357
scaffold53 1799 . A G 11721.2 PASS AC=1;AF=0.100;AN=10;BaseQRankSum=-2.752e+00;ClippingRankSum=0.00;DP=139;ExcessHet=3.7680;FS=2.569;InbreedingCoeff=-0.0576;MQ=76.95;MQRankSum=2.152;POSITIVE_TRAIN_SITE;QD=22.80;ReadPosRankSum=0.807;SOR=0.486;VQSLOD=5.76;culprit=FS;set=variant20 GT:AD:DP:GQ:PGT:PID:PL 0/0:34,0:34:99:.:.:0,99,1010 0/0:35,0:35:99:.:.:0,102,1005 0/0:31,0:31:90:.:.:0,90,1350 0/1:9,10:19:99:0|1:1793_CT_C:380,0,442 0/0:20,0:20:57:.:.:0,57,855
scaffold53 1926 . T C 15709.2 PASS AC=3;AF=0.300;AN=10;BaseQRankSum=-2.196e+00;ClippingRankSum=0.00;DP=126;ExcessHet=0.6002;FS=0.000;InbreedingCoeff=0.2059;MQ=65.52;MQRankSum=0.678;POSITIVE_TRAIN_SITE;QD=22.25;ReadPosRankSum=-0.468;SOR=0.668;VQSLOD=6.23;culprit=DP;set=variant20 GT:AD:DP:GQ:PL 0/0:30,0:30:61:0,61,872 0/1:8,18:26:99:429,0,176 0/0:30,0:30:81:0,81,1215 0/1:14,7:21:99:148,0,386 0/1:13,6:19:99:114,0,341
scaffold53 1956 . G A 11382.4 PASS AC=1;AF=0.100;AN=10;BaseQRankSum=2.67;ClippingRankSum=0.00;DP=132;ExcessHet=5.6547;FS=0.000;InbreedingCoeff=-0.1352;MQ=68.43;MQRankSum=0.000;POSITIVE_TRAIN_SITE;QD=16.81;ReadPosRankSum=-0.352;SOR=0.733;VQSLOD=8.93;culprit=MQRankSum;set=variant20 GT:AD:DP:GQ:PL 0/0:32,0:32:63:0,63,904 0/0:29,0:29:81:0,81,1215 0/0:27,0:27:78:0,78,1170 0/1:13,10:23:99:266,0,332 0/0:21,0:21:60:0,60,900
scaffold53 6213 . G A 13615 PASS AC=2;AF=0.200;AN=10;BaseQRankSum=3.09;ClippingRankSum=0.00;DP=155;ExcessHet=6.8518;FS=1.104;InbreedingCoeff=-0.1832;MQ=69.25;MQRankSum=-0.713;POSITIVE_TRAIN_SITE;QD=18.11;ReadPosRankSum=-0.138;SOR=0.606;VQSLOD=4.32;culprit=FS;set=variant20 GT:AD:DP:GQ:PL 0/0:25,0:25:23:0,23,524 0/0:36,0:36:99:0,99,957 0/0:38,0:38:0:0,0,758 0/1:11,14:25:99:368,0,215 0/1:13,18:31:99:446,0,224
scaffold53 6794 . G A 6939.88 PASS AC=2;AF=0.200;AN=10;BaseQRankSum=2.48;ClippingRankSum=0.00;DP=117;ExcessHet=9.2123;FS=5.276;InbreedingCoeff=-0.2454;MQ=76.23;MQRankSum=0.754;QD=13.74;ReadPosRankSum=0.423;SOR=0.412;VQSLOD=0.800;culprit=QD;set=variant20 GT:AD:DP:GQ:PL 0/0:25,0:25:72:0,72,747 0/0:19,0:19:54:0,54,810 0/0:27,0:27:75:0,75,837 0/1:9,11:20:99:236,0,190 0/1:14,12:26:99:283,0,282
scaffold53 7613 . G C 4715.43 PASS AC=3;AF=0.300;AN=10;BaseQRankSum=0.367;ClippingRankSum=0.00;DP=131;ExcessHet=1.7167;FS=0.000;InbreedingCoeff=0.0683;MQ=92.82;MQRankSum=0.000;QD=13.63;ReadPosRankSum=-1.640;SOR=0.626;VQSLOD=1.57;culprit=QD;set=variant20 GT:AD:DP:GQ:PL 0/1:11,11:22:99:260,0,265 0/1:15,18:33:99:433,0,366 0/1:16,12:28:99:280,0,391 0/0:23,0:23:63:0,63,945 0/0:25,0:25:72:0,72,746
scaffold53 7643 . T G 18643.4 PASS AC=7;AF=0.700;AN=10;BaseQRankSum=-2.330e+00;ClippingRankSum=0.00;DP=158;ExcessHet=1.7167;FS=1.314;InbreedingCoeff=0.0683;MQ=62.44;MQRankSum=0.000;POSITIVE_TRAIN_SITE;QD=22.41;ReadPosRankSum=0.577;SOR=0.679;VQSLOD=10.21;culprit=MQRankSum;set=variant20 GT:AD:DP:GQ:PL 0/1:10,16:26:99:384,0,238 0/1:21,23:44:99:568,0,509 0/1:12,19:31:99:471,0,292 1/1:0,27:27:81:808,81,0 1/1:0,30:30:89:836,89,0
scaffold53 8477 . T G 4933.45 PASS AC=3;AF=0.300;AN=10;BaseQRankSum=-2.717e+00;ClippingRankSum=0.00;DP=122;ExcessHet=1.7167;FS=4.887;InbreedingCoeff=0.0681;MQ=93.80;MQRankSum=0.000;QD=15.13;ReadPosRankSum=1.452;SOR=0.833;VQSLOD=2.21;culprit=SOR;set=variant20 GT:AD:DP:GQ:PL 0/1:10,11:21:99:272,0,261 0/1:15,17:32:99:437,0,397 0/1:12,12:24:99:276,0,309 0/0:25,0:25:72:0,72,1080 0/0:20,0:20:57:0,57,855
scaffold53 8654 . A G 15051 PASS AC=6;AF=0.600;AN=10;BaseQRankSum=-2.419e+00;ClippingRankSum=0.00;DP=125;ExcessHet=1.4147;FS=0.000;InbreedingCoeff=0.0971;MQ=63.92;MQRankSum=0.000;POSITIVE_TRAIN_SITE;QD=21.23;ReadPosRankSum=0.780;SOR=0.672;VQSLOD=9.32;culprit=DP;set=variant20 GT:AD:DP:GQ:PL 0/1:14,12:26:99:252,0,321 0/1:14,10:24:99:200,0,339 0/1:6,18:24:99:407,0,108 1/1:0,26:26:78:733,78,0 0/1:8,17:25:99:417,0,164
Presuming that DP always comes before QD on a line, you could do something like this:
Match fields preceded by 'DP' or 'QD'.
pattern = re.compile(r"(?:DP|QD)=([^;]+)")
Use a list comprehension to build a list of [DP value, QD value] from the lines in the file.
with gzip.open(sys.argv[1], 'rt') as vcf:
data = [pattern.findall(line) for line in vcf if not line.startswith("#")]
Convert to an numpy array.
data = np.array(data)
Use numpy functions to calculate on both columns at once.
dp_mean, qd_mean = np.mean(data, axis=1)
dp_median, qd_median = np.median(data, axis=1)
Answered by RootTwo on December 10, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP