Get combinations based on a specific column of dataframe in python

Question

I have a dataframe with 3 columns: equivalences, class, ch. I am using Python.
equivalences                             class                                              ch

ETICA CONTABIL                           A ÉTICA CONTÁBIL                                   40.0
ETICA CONTABIL                           A ÉTICA CONTÁBIL COM ENFOQUE                       40.0
BANCO DE DADOS                           GERENCIANDO SEU BD                                 40.0
AMBIENTE WEB                             APLICAÇÕES EM NUVENS                               40.0
AMBIENTE WEB                             ALTA DISPONIBILIDADE                               40.0
TECNOLOGIAS WEB                          PÁGINAS PARA INTERNET                              40.0
TECNOLOGIAS WEB                          PROGRAMAÇÃO WEB AVANÇADA                           40.0
TECNOLOGIAS WEB                          DESENVOLVENDO COM JS                               40.0
None                                     PROGRAMAÇÃO WEB                                    40.0

I need to get the pair combinations of equivalences, summing the ch of this pair. It should be something like this:
equivalences      class a                   class b                                  ch

ETICA CONTABIL    A ÉTICA CONTÁBIL          A ÉTICA CONTÁBIL COM ENFOQUE            80.0
BANCO DE DADOS    GERENCIANDO SEU BD        (null)                                  40.0
AMBIENTE WEB      APLICAÇÕES EM NUVENS      ALTA DISPONIBILIDADE                    80.0
TECNOLOGIAS WEB   PÁGINAS PARA INTERNET     PROGRAMAÇÃO WEB AVANÇADA                80.0
TECNOLOGIAS WEB   PÁGINAS PARA INTERNET     DESENVOLVENDO COM JS                    80.0
TECNOLOGIAS WEB   PROGRAMAÇÃO WEB AVANÇADA  DESENVOLVENDO COM JS                    80.0
(null)            PROGRAMAÇÃO WEB           (null)                                  40.0

I think I would have to use combinations itertools, but I have no clue how i group by equivalences to get distinct pairs.
How can I do that?

nimbous · Answer

Let's assume df is your dataframe, get the pair combinations on a separate dataframe called pairs as below first using itertools:
import itertools

pairs = df.groupby('equivalences', )['class'].unique().to_frame()
func = lambda x: list(itertools.combinations(x, 2)) if len(x) > 1 else x
pairs['combinations'] = pairs['class'].map(func)

Then apply a nested for loop to output the results for each equivalences and class pairs as below:
records = []
for i in pairs.index:
    for j in pairs.loc[i, 'combinations']:
        if isinstance(j, tuple):
            records.append(
                {
                    'equivalences': i,
                    'class a': j[0],
                    'class b': j[1],
                    'ch': df.loc[(df['equivalences'] == i) & (df['class'].isin(j)), 'ch'].sum()
                }
            )
        else:
            records.append(
                {
                    'equivalences': i,
                    'class a': j,
                    'class b': 'null',
                    'ch': df.loc[(df['equivalences'] == i) & (df['class'] == j), 'ch'].sum()
                }
            )

pd.DataFrame.from_dict(records,)

Output:
    equivalences    class a class b ch
0   AMBIENTE WEB    APLICAÇÕES EM NUVENS    ALTA DISPONIBILIDADE    80
1   BANCO DE DADOS  GERENCIANDO SEU BD  null    40
2   ETICA CONTABIL  A ÉTICA CONTÁBIL    A ÉTICA CONTÁBIL COM ENFOQUE    80
3   TECNOLOGIAS WEB PÁGINAS PARA INTERNET   PROGRAMAÇÃO WEB AVANÇADA    80
4   TECNOLOGIAS WEB PÁGINAS PARA INTERNET   DESENVOLVENDO COM JS    80
5   TECNOLOGIAS WEB PROGRAMAÇÃO WEB AVANÇADA    DESENVOLVENDO COM JS    80
6   null    PROGRAMAÇÃO WEB null    40

On another note, don't forget to convert your null values to a string or any value other then None before applying groupby in the first place, as pandas groupby does not support grouping None yet. You can always convert your string null values to real None when you are done.

Roy2012 · Answer

Here's a solution (in a few steps for clarity):
# create a cross product of classes per "equivalences"
t = pd.merge(df.assign(dummy = 1), df.assign(dummy=1), 
         on = ["dummy", "equivalences"])

# drop items in which the left and the right class are identical
t = t[t.class_x != t.class_y]

# drop duplicates such as x,y vs y,x
t.loc[t.class_x > t.class_y, ["class_x", "class_y"]] = 
    t.loc[t.class_x > t.class_y, ["class_x", "class_y"]].rename(columns = {"class_x": "class_y", "class_y": "class_x"})
t = t.drop_duplicates(subset = ["equivalences", "class_x", "class_y"])

t["ch"] = t.ch_x + t.ch_y
res = t.drop(["ch_x", "dummy", "ch_y"], axis=1)
print(res)

==>

equivalences                   class_x                       class_y    ch
1    ETICA CONTABIL          A ÉTICA CONTÁBIL  A ÉTICA CONTÁBIL COM ENFOQUE  80.0
6      AMBIENTE WEB      ALTA DISPONIBILIDADE          APLICAÇÕES EM NUVENS  80.0
10  TECNOLOGIAS WEB  PROGRAMAÇÃO WEB AVANÇADA         PÁGINAS PARA INTERNET  80.0
11  TECNOLOGIAS WEB      DESENVOLVENDO COM JS         PÁGINAS PARA INTERNET  80.0
14  TECNOLOGIAS WEB      DESENVOLVENDO COM JS      PROGRAMAÇÃO WEB AVANÇADA  80.0

Get combinations based on a specific column of dataframe in python

2 Answers

Add your own answers!

Ask a Question