TransWikia.com

Finding the duplicate values between all columns and sort in new column with Pandas?

Data Science Asked by Jsmoka on June 21, 2021

I have this DataFrame:

    CL1 CL2 CL3 CL4
0   a   a   b   f
1   b   y   c   d
2   c   x   d   s
3   x   s   x   a
4   s   dx  s   s
5   a   c   d   d
6   s   dx  f   d
7   d   dc  g   g
8   f   x   s   t
9   c   x   a   d
10  x   y   y   a
11  c   a   x   y
12  f   s   d   s
13  d   d   w   a

Intention:

  • With help Pandas

1- I wanna search and find the similar values between all columns (CL1-CL4) and sort in a new column (SIM).

2- I wanna find the non-similar values between columns and sort in another column (NON-SIM).

What i want

enter image description here
How can i do that? With df.pivot_table i was not successful.

2 Answers

Given your input data is saved in a variable df, I count the values which occur in all 4 unique columns as follows:

import pandas as pd
import numpy as np

output = (
    df
    .melt()
    .drop_duplicates()
    .groupby("value")
    .agg(count=("value", "count"))
    .reset_index()
)
output["SIM"] = np.where(output["count"] == 4, "SIM", "NON-SIM")
output = output.pivot(columns="similarity", values="value")

print(output)

similarity  NON-SIM SIM
0           NaN     a
1           b       NaN
2           c       NaN
3           NaN     d
4           dc      NaN
5           dx      NaN
6           f       NaN
7           g       NaN
8           NaN     s
9           t       NaN
10          w       NaN
11          x       NaN
12          y       NaN

Correct answer by Oxbowerce on June 21, 2021

This may be simpler using sets. Assuming your dataframe is df, first get each column's unique values as a set:

import pandas as pd
from functools import reduce
# df = pd.read_clipboard()
cols = df.agg(set)
print(cols)

This gives a pandas Series of python set objects:

CL1             {c, s, d, a, x, f, b}
CL2        {c, dc, y, s, d, a, x, dx}
CL3    {c, s, y, d, a, x, g, f, w, b}
CL4             {s, y, d, a, g, t, f}
dtype: object

Then you can combine these using set operations to get your desired results:

sim = reduce(set.intersection, cols)
non_sim = reduce(set.union, cols) - sim
print(sim, non_sim, sep='n')

This gives the results as two set objects:

{'d', 'a', 's'}
{'c', 'dc', 'y', 'x', 'g', 't', 'f', 'dx', 'w', 'b'}

Answered by GeoMatt22 on June 21, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP