Bioinformatics Asked by Nitha on March 22, 2021
I have very large sizes tab-delimited .vcf files and want to match these two / or 3 files based on their position and print to a new .csv file
File structures:
File_1: tab-delimited file (.vcf) and its as column names as follows
(line number 3439) #CHROM POS ID REF ALT QUAL FILTER
INFO
File 2: same as file_1 column names
(line number 3407) #CHROM POS ID REF ALT QUAL FILTER
INFO FORMAT SAMPLE_1 SAMPLE_2 .....
In file_2, column 7 (INFO) contain many substrings like
AC=46,2;AF=0.958,0.042;AN=48;DP=269;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.5411;MLEAC=92,4;MLEAF=1.00,0.083;QD=25.36;SOR=2.488 from these strings , have to print only information of AC=, AF=, AN=, DP=
Desired output files to generate:
if I have more than 3 files also, the output should be one file and important thing, if a position column (1-POS) only matched only in 2 files and 1st file line should be NA NA NA
file_1 CHROM POS REF ALT file_2 #CHROM POS REF ALT INFO
1 22 10511521 C T 1 22 10511521 C T AC=46,2 AF=0.958,0.042 AN=48 DP=269
2 22 10510544 G A 2 22 10510544 G A AC=49,2 AF=0.958,0.042 AN=89 DP=536
3 22 10515068 AGAT,T AGAT,AT 3 22 10515068 AAA AAAGG,A,GAA AC=100 AF=0.958,0.042 AN=62 DP=123
4 22 10515118 A G, TAA 4 22 10515118 AG, TAA AC=32 AF=0.958,0.042 AN=45 DP=500
5 22 10515118 AAAG A 5 22 10515118 AATG A AC=50 AF=0.958,0.042 AN=49 DP=129
note: while doing matching, not removing the duplicates, because in the same position there may be an addition or sometimes it may be deletion.
output: File1_unique.csv and File2_unique.csv etc.
so far was able to read the file and match them according to position and print the output, but was not able to write efficient code
import pandas as pd
df1 = pd.read_csv("File1_3.vcf",sep='t',usecols = ['POS']) ## Reading file1
df2 = pd.read_csv("file2_3.vcf", sep="t", usecols = ['POS']) ## Reading file2
df3 = pd.concat([df1,df2], sort=True) ## Combining both the dataframes
df4 = df3.drop_duplicates(keep=False) ## Dropping the duplicates (intersect)
df4.to_csv("c3-UniquePosition_of_bothData.csv", sep="t", index=False, header=True) ## Writing the unique to both
df1_Uni_file1_c3Posi = pd.merge(df4, df1, on='POS', how='inner') ## Identifying the unique position of File1
df2_Uni_File2_c3Posi = pd.merge(df4, df2, on='POS', how='inner') ## Identifying the unique position of File2
df_File1_File2_common_c3Posi = pd.merge(df1, df2, on='POS', how='inner') # Identifying the common chr-position of File1 and File2```
Program 2: (giving original file without editting)
import pandas as pd
df1= pd.read_csv("File1_22.vcf.gz", sep="t", skiprows=3438, usecols = [0,1,2,3,4])
df2 = pd.read_csv("File2_22.vcf.gz", sep="t", skiprows=3406, usecols = [0,1,3,4,7])
#writing the output files
#df1.to_csv("File1_c22.csv")
#df2.to_csv("File2_c22.csv")
#mergeing
df3 = pd.merge(df1, df2, on='POS', how='inner', sort=True)
df3.to_csv("common_position.csv", sep=",", index=False, header=True)
#df3 = pd.concat([df1,df2], axis=1).to_csv('check1.csv') # this command join multiple output to single output
Could any one give efficient python pandas script, to do this
Thanks All
One way to do this is to use a dictionary. You can read the first file line by line, and for each line, use "chr:position" as a key, and the corresponding line as the value. If you want to handle duplicates, you can make the value a list and store multiple lines with duplicated "chr:position" as elements in that list.
After constructing this dictionary, read the second file line by line. For each line, set "chr:position" as the so-called name of that line. Then using this "chr:position", you can easily find the corresponding line in the dictionary. Append the line from the second file to that line from the first file.
After finishing reading the second file, you are ready to write the new file to output. To do so, iterate over the keys in the dictionary, and write the value to each key to a new file. Remember that each value in that dictionary is now a new line.
I choose to use a dictionary because they are basically a hash table, meaning that using the key you can instantly read that value in the dictionary, instead of iterating every element in a list to find that element. This is very handy when you have a large dictionary.
Answered by Phoenix Mu on March 22, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP