Stack Overflow Asked by Soumya Ranjan Sahoo on January 17, 2021
Consider the following snippet:
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
df = pd.DataFrame(data,columns=["col1","col2"]) # my actual dataframe has
# 20,00,000 such rows
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
# After doing a combination of 2 elements between the 2 lists in both orders,
# we get a list that resembles something like this:
new_list = ["ccc-ggg", "ggg-ccc", "aaa-fff", "fff-aaa", ..."ccc-fff", "fff-ccc", ...]
Given a huge dataframe and 2 lists, I want to count the number of elements in new_list that are in the same in the dataframe. In the above pseudo example, The result would be 3 as: "aaa-fff", "ccc-ggg", & "ddd-ccc" are in the same row of the dataframe.
Right now, I am using a linear search algorithm but it is very slow as I have to scan through the entire dataframe.
df['col3']=df['col1']+"-"+df['col2']
for a in list_a:
c1 = 0
for b in list_b:
str1=a+"-"+b
str2=b+"-"+a
str1=a+"-"+b
c2 = (df['col3'].str.contains(str1).sum())+(df['col3'].str.contains(str2).sum())
c1+=c2
return c1
Can someone kindly help me implement a faster algorithm preferably with a dictionary data structure?
Note: I have to iterate through the 7,000 rows of another dataframe and create the 2 lists dynamically, and get an aggregate count for each row.
Here is another way. First, I used your definition of df (with 2 columns), list_a and list_b.
# combine two columns in the data frame
df['col3'] = df['col1'] + '-' + df['col2']
# create set with list_a and list_b pairs
s = ({ f'{a}-{b}' for a, b in zip(list_a, list_b)} |
{ f'{b}-{a}' for a, b in zip(list_a, list_b)})
# find intersection
result = set(df['col3']) & s
print(len(result), 'n', result)
3
{'ddd-ccc', 'ccc-ggg', 'aaa-fff'}
UPDATE to handle duplicate values.
# build list (not set) from list_a and list_b
idx = ([ f'{a}-{b}' for a, b in zip(list_a, list_b) ] +
[ f'{b}-{a}' for a, b in zip(list_a, list_b) ])
# create `col3`, and do `value_counts()` to preserve info about duplicates
df['col3'] = df['col1'] + '-' + df['col2']
tmp = df['col3'].value_counts()
# use idx to sub-select from to value counts:
tmp[ tmp.index.isin(idx) ]
# results:
ddd-ccc 1
aaa-fff 1
ccc-ggg 1
Name: col3, dtype: int64
Answered by jsmart on January 17, 2021
First join the columns before looping, then instead of looping pass an optional regex to contains with all possible strings.
joined = df.col1+ '-' + df.col2
pat = '|'.join([f'({a}-{b})' for a in list_a for b in list_b] +
[f'({b}-{a})' for a in list_a for b in list_b]) # substitute for itertools.product
ct = joined.str.contains(pat).sum()
To work with dicts instead of dataframes, you can use filter(re, joined)
as in this question
import re
data = {"col1":["aaa","bbb","ccc","aaa","ddd","bbb"],
"col2":["fff","aaa","ggg","eee","ccc","ttt"]}
list_a = ["ccc","aaa","mmm","nnn","ccc"]
list_b = ["ggg","fff","eee","ooo","ddd"]
### build the regex pattern
pat_set = set('-'.join(combo) for combo in set(
list(itertools.product(list_a, list_b)) +
list(itertools.product(list_b, list_a))))
pat = '|'.join(pat_set)
# use itertools to generalize with many colums, remove duplicates with set()
### join the columns row-wise
joined = ['-'.join(row) for row in zip(*[vals for key, vals in data.items()])]
### filter joined
match_list = list(filter(re.compile(pat).match, joined))
ct = len(match_list)
Third option with series.isin()
inspired by jsmart's answer
joined = df.col1 + '-' + df.col2
ct = joined.isin(pat_set).sum()
Speed testing
I repeated data 100,000 times for scalability testing. series.isin()
takes the day, while jsmart's answer is fast but does not find all occurrences because it removes duplicates from joined
with dicts: 400000 matches, 1.00 s
with pandas: 400000 matches, 1.77 s
with series.isin(): 400000 matches, 0.39 s
with jsmart answer: 4 matches, 0.50 s
Answered by RichieV on January 17, 2021
Try this:
from itertools import product
# all combinations of the two lists as tuples
all_list_combinations = list(product(list_a, list_b))
# tuples of the two columns
dftuples = [x for x in df.itertuples(index=False, name=None)]
# take the length of hte intersection of the two sets and print it
print(len(set(dftuples).intersection(set(all_list_combinations))))
yields
3
Answered by tomanizer on January 17, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP