Finding common and unique data set by comparing two files based on their column and to split the columns multiple strings to print in output

Question

I have  very large sizes tab-delimited .vcf files and want to match these two / or 3 files based on their position and print to a new .csv file

File structures:

File_1: tab-delimited file (.vcf) and its as column names as follows

(line number 3439) #CHROM     POS     ID       REF      ALT      QUAL    FILTER 
 INFO

File 2: same as file_1 column names

(line number 3407) #CHROM     POS     ID      REF     ALT     QUAL    FILTER 
 INFO  FORMAT  SAMPLE_1  SAMPLE_2 .....

In file_2, column 7 (INFO) contain many substrings like
AC=46,2;AF=0.958,0.042;AN=48;DP=269;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=0.5411;MLEAC=92,4;MLEAF=1.00,0.083;QD=25.36;SOR=2.488  from these strings , have to print  only information of AC=, AF=, AN=, DP=

Desired output files to generate:

position matched in two of files: common_position_matched.csv

if I have more than 3 files also, the output should be one file and important thing, if a position column (1-POS) only matched only in 2 files and 1st file line should be NA NA NA

file_1  CHROM  POS    REF  ALT     file_2 #CHROM  POS      REF  ALT  INFO

1       22    10511521  C     T         1    22  10511521   C    T   AC=46,2 AF=0.958,0.042 AN=48 DP=269  
2       22   10510544   G     A         2    22  10510544   G    A   AC=49,2 AF=0.958,0.042 AN=89 DP=536  
3       22   10515068  AGAT,T AGAT,AT   3    22  10515068   AAA  AAAGG,A,GAA AC=100  AF=0.958,0.042 AN=62 DP=123  
4       22   10515118  A G,   TAA       4    22  10515118   AG,  TAA AC=32   AF=0.958,0.042 AN=45 DP=500  
5       22   10515118  AAAG   A         5    22  10515118   AATG A   AC=50   AF=0.958,0.042 AN=49 DP=129

note: while doing matching, not removing the duplicates, because in the same position there may be an addition or sometimes it may be deletion.

unique position of each file, in tab-delimited

output: File1_unique.csv and File2_unique.csv etc.

so far was able to read the file and match them according to position and print the output, but was not able to write efficient code

import pandas as pd
df1 = pd.read_csv("File1_3.vcf",sep='t',usecols = ['POS']) ## Reading file1
df2 = pd.read_csv("file2_3.vcf", sep="t", usecols = ['POS']) ## Reading file2
df3 = pd.concat([df1,df2], sort=True) ## Combining both the dataframes
df4 = df3.drop_duplicates(keep=False) ## Dropping the duplicates (intersect)
df4.to_csv("c3-UniquePosition_of_bothData.csv", sep="t", index=False, header=True) ## Writing the unique to both
df1_Uni_file1_c3Posi = pd.merge(df4, df1, on='POS', how='inner') ## Identifying the unique position of File1 
df2_Uni_File2_c3Posi = pd.merge(df4, df2, on='POS', how='inner') ## Identifying the unique position of File2
df_File1_File2_common_c3Posi = pd.merge(df1, df2, on='POS', how='inner') # Identifying the common chr-position of File1 and File2```

Program 2: (giving original file without editting)
import pandas as pd
df1= pd.read_csv("File1_22.vcf.gz", sep="t", skiprows=3438, usecols = [0,1,2,3,4])
df2 = pd.read_csv("File2_22.vcf.gz", sep="t", skiprows=3406, usecols = [0,1,3,4,7])
#writing the output files
#df1.to_csv("File1_c22.csv")
#df2.to_csv("File2_c22.csv")
#mergeing
df3 = pd.merge(df1, df2, on='POS', how='inner', sort=True)
df3.to_csv("common_position.csv", sep=",", index=False, header=True)
#df3 = pd.concat([df1,df2], axis=1).to_csv('check1.csv') # this command join multiple output to single output

Could any one give efficient python pandas script, to do this

Thanks All

Phoenix Mu · Answer

One way to do this is to use a dictionary. You can read the first file line by line, and for each line, use "chr:position" as a key, and the corresponding line as the value. If you want to handle duplicates, you can make the value a list and store multiple lines with duplicated "chr:position" as elements in that list.

After constructing this dictionary, read the second file line by line. For each line, set "chr:position" as the so-called name of that line. Then using this "chr:position", you can easily find the corresponding line in the dictionary. Append the line from the second file to that line from the first file.

After finishing reading the second file, you are ready to write the new file to output. To do so, iterate over the keys in the dictionary, and write the value to each key to a new file. Remember that each value in that dictionary is now a new line.

I choose to use a dictionary because they are basically a hash table, meaning that using the key you can instantly read that value in the dictionary, instead of iterating every element in a list to find that element. This is very handy when you have a large dictionary.

Finding common and unique data set by comparing two files based on their column and to split the columns multiple strings to print in output

One Answer

Add your own answers!

Ask a Question