Bioinformatics Asked on March 20, 2021
mutations = ['A222V', 'D614G', 'E484Q', 'E780Q', 'G476S', 'L18F', 'N439K',
'S477', 'S477N', 'T478I', 'V483A']
combinations = []
for M in range(1, len(mutations)+1):
for subset in itertools.combinations(mutations, M):
combinations.append(subset)
combinations = ['_'.join(sorted(x)) for x in combinations]
combinations = [x.split('_') for x in list(set(combinations))]
root = "C:"
os.chdir(root)
lineages = os.listdir('Results')
combination_labels = []
combination_counts = []
for lineage in lineages:
df = pd.read_csv('Results/' + lineage).dropna()
for combination in combinations:
combination_df = df[list(df)]
for mutation in combination:
combination_df = combination_df[combination_df[mutation] == 1]
#print(combination_df.shape[0])
combination_labels.append('_'.join(combination))
combination_counts.append(combination_df.shape[0])
out_df = pd.DataFrame({'combination':combination_labels,
'count':combination_counts})
out_df['percentage'] = (out_df['count'] / df.shape[0]) * 100
out_df = out_df.sort_values('percentage', ascending = False)
out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv',
header = True,
index = False)
The input CSV
lineage,A222V,D614G,E484Q,E780Q,G476S,L18F,N439K,S477,S477N,T478I,V483A
417941,0,1,0,0,0,0,0,0,0,0,0
Output CSV
combination,count,percentage
D614G,87355,90.7084929856806
My above code is used to count all occurences of combinations of spike protein mutations
My question is how do i ensure once something has been counted (a row in the imported .csv file)
it is not counted for a second time?
OR possibly how can I edit this code to prevent counting of singular mutations even when presented within a combination?
Thanks in advance
Alright, so there are a number of problematic patterns in your code - as far as I understand what you are trying to do. Next time, try to post a reproducible example that people can use and more people will be willing to help.
combination_labels = []
combination_counts = []
for lineage in lineages:
Declaring these two lists before the loop, then appending all your combination labels inside the loop means that these two lists will contain all labels and counts for all runs (all your lineages). If I am understanding your intentions correctly this is probably not something you want. The simple fix here is to just move them into the for-loop.
for mutation in combination:
combination_df = combination_df[combination_df[mutation] == 1]
Here you are iteratively checking if the value for each mutation is set to '1', but you are also overwriting your combination_df
variable at each iteration of the loop. After this loop, combination_df
will be whatever mutation last was set to '1' in your current combination. I will get back to solving this in my solution below.
combination_labels.append('_'.join(combination))
combination_counts.append(combination_df.shape[0])
out_df = pd.DataFrame({'combination':combination_labels,
'count':combination_counts})
out_df['percentage'] = (out_df['count'] / df.shape[0]) * 100
out_df = out_df.sort_values('percentage', ascending = False)
out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv',
header = True,
index = False)
Since this part is again scoped inside the for-loop, each iteration (each combination) will overwrite the same 'Results_2/' + lineage.replace(".csv", "") + '_3.csv'
file. This you also want to move out of the for combination in ...
loop.
A big concern is also performance. You have 2000-something combinations and 100,000 lines in your file. As you loop over combinations, you also process the DataFrame each time, when this is a proble you can solve in one iteration. It's not much data so it's still fine, but these habits are still good to develop early.
Now here's an idea for a rewrite that should work better. I don't have a csv from you to test with, so there might (proabably are) still things to fix, but hopefully it'll give you a starting point:
mutations = ['A222V', 'D614G', 'E484Q', 'E780Q', 'G476S', 'L18F', 'N439K',
'S477', 'S477N', 'T478I', 'V483A']
combinations = []
for M in range(1, len(mutations)+1):
for subset in itertools.combinations(mutations, M):
combinations.append(subset)
combinations = ['_'.join(sorted(x)) for x in combinations]
root = "C:"
os.chdir(root)
lineages = os.listdir('Results')
for lineage in lineages:
out_df = pd.DataFrame({"Count": 0}, index=combinations)
df = pd.read_csv('Results/' + lineage, index_col="lineage").dropna()
for index, row in df.iterrows():
combination = '_'.join(sorted(df.columns[row==1]))
out_df.loc[combination, "Count"] += 1
out_df['Percentage'] = (out_df['Count'] / df.shape[0]) * 100
out_df = out_df.sort_values('Percentage', ascending = False)
out_df.to_csv('Results_2/' + lineage.replace(".csv", "") + '_3.csv', header = True,
index = True)
Correct answer by Bastian Schiffthaler on March 20, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP