Finding the duplicate values between all columns and sort in new column with Pandas?

Question

I have this DataFrame:
    CL1 CL2 CL3 CL4
0   a   a   b   f
1   b   y   c   d
2   c   x   d   s
3   x   s   x   a
4   s   dx  s   s
5   a   c   d   d
6   s   dx  f   d
7   d   dc  g   g
8   f   x   s   t
9   c   x   a   d
10  x   y   y   a
11  c   a   x   y
12  f   s   d   s
13  d   d   w   a

Intention:

With help Pandas

1- I wanna search and find the similar values between all columns (CL1-CL4) and sort in a new column (SIM).
2- I wanna find the non-similar values between columns and sort in another column (NON-SIM).
What i want

How can i do that? With df.pivot_table i was not successful.

Oxbowerce · Accepted Answer

Given your input data is saved in a variable df, I count the values which occur in all 4 unique columns as follows:
import pandas as pd
import numpy as np

output = (
    df
    .melt()
    .drop_duplicates()
    .groupby("value")
    .agg(count=("value", "count"))
    .reset_index()
)
output["SIM"] = np.where(output["count"] == 4, "SIM", "NON-SIM")
output = output.pivot(columns="similarity", values="value")

print(output)

similarity  NON-SIM SIM
0           NaN     a
1           b       NaN
2           c       NaN
3           NaN     d
4           dc      NaN
5           dx      NaN
6           f       NaN
7           g       NaN
8           NaN     s
9           t       NaN
10          w       NaN
11          x       NaN
12          y       NaN

GeoMatt22 · Answer

This may be simpler using sets. Assuming your dataframe is df, first get each column's unique values as a set:
import pandas as pd
from functools import reduce
# df = pd.read_clipboard()
cols = df.agg(set)
print(cols)

This gives a pandas Series of python set objects:
CL1             {c, s, d, a, x, f, b}
CL2        {c, dc, y, s, d, a, x, dx}
CL3    {c, s, y, d, a, x, g, f, w, b}
CL4             {s, y, d, a, g, t, f}
dtype: object

Then you can combine these using set operations to get your desired results:
sim = reduce(set.intersection, cols)
non_sim = reduce(set.union, cols) - sim
print(sim, non_sim, sep='n')

This gives the results as two set objects:
{'d', 'a', 's'}
{'c', 'dc', 'y', 'x', 'g', 't', 'f', 'dx', 'w', 'b'}

Finding the duplicate values between all columns and sort in new column with Pandas?

What i want

2 Answers

Add your own answers!

Ask a Question