TransWikia.com

Multiple filtering pandas columns based on values in another column

Data Science Asked by Remus Raphael on December 21, 2020

I have a pandas dataframe df1:

df1

Now, I want to filter the rows in df1 based on unique combinations of (Campaign, Merchant) from another dataframe, df2, which look like this:

enter image description here

What I tried is using .isin, with a code similar to the one below:

df1.loc[df1['Campaign'].isin(df2['Campaign']) &
        df1['Merchant'].isin(df2['Merchant'])]

The problem here is that the conditions are independent eg : I want to check if (A,1) from df2 is in df1, but with the above condition, since I am checking all the list, not row by row, it would return all rows in df1 where Campaign column is A OR Merchant column is 1.

Do you have any suggestion for this multiple pandas filtering?

3 Answers

import pandas as pd

df1 = pd.DataFrame({"Random numbers 1": pd.np.random.randn(6),
                    "Campaign": ["A"] * 5 + ["B"],
                    "Merchant": [1, 1, 1, 2, 3, 1]})

df2 = pd.DataFrame({"Random numbers 2": pd.np.random.randn(6),
                    "Campaign": ["A"] * 2 + ["B"] * 2 + ["C"] * 2,
                    "Merchant": [1, 2, 1, 2, 1, 2]})

columns_consider = ["Campaign", "Merchant"]
combined = pd.concat((df1[columns_consider].drop_duplicates(),
                      df2[columns_consider].drop_duplicates()), ignore_index=True)

identical = combined[combined.duplicated()]

print(identical)

Output:

  Campaign  Merchant
4        A         1
5        A         2
6        B         1

Answered by tuomastik on December 21, 2020

Bit late but my preferred solution to this is

# verbetim from @tuomastik

import pandas as pd

df1 = pd.DataFrame({"Random numbers 1": pd.np.random.randn(6),
                "Campaign": ["A"] * 5 + ["B"],
                "Merchant": [1, 1, 1, 2, 3, 1]})

df2 = pd.DataFrame({"Random numbers 2": pd.np.random.randn(6),
                "Campaign": ["A"] * 2 + ["B"] * 2 + ["C"] * 2,
                "Merchant": [1, 2, 1, 2, 1, 2]})

# modification

def pair_columns(df, col1, col2):
   return df[col1] + df[col2]

def paired_mask(df1, df2, col1, col2):
   return pair_columns(df1, col1, col2).isin(pair_columns(df2, col1, col2))

identical = df1.loc[paired_mask(df1, df2, "Campaign", "Merchant")]

Answered by user91338 on December 21, 2020

The way I always go about it is by creating a lookup column:

df1['lookup'] = df1['Campaign'] + "_" + df1['Merchant'].astype(str)
df2['lookup'] = df2['Campaign'] + "_" + df2['Merchant'].astype(str)

Then use loc to filter and drop the lookup columns:

df1.loc[df1['lookup'].isin(df2['lookup'])]
df1.drop(columns='lookup', inplace=True)

I'm still looking for a better solution.

Answered by Jason on December 21, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP