Create a pivot table using two columns treating them as a group

Question

I have seen plenty of answers that address replicating R’s dcast functionality using a single column as the index but I am having a hard time replicating a dcast where you are combining columns to group them and then pivoting. I keep running into issues when I try to use pivot_table or crosstab where I end up with either dropped columns or mixed up names.

I have a DataFrame that looks like:

import pandas as pd
import numpy as np

df = pd.DataFrame({'class': ['History', 'History', 'English','Math','Gym', 'Gym'], 
                   'first': ['John','Mary','John', 'Charles', 'John', 'Charles'], 
                   'last': ['Smith', 'Jones', 'Smith', 'Right', 'Smith', 'Right'], 
                   'grade': ['1','2','1','3', np.nan, np.nan] })


     class    first   last grade
0  History     John  Smith     1
1  History     Mary  Jones     2
2  English     John  Smith     1
3     Math  Charles  Right     3
4      Gym     John  Smith   NaN
5      Gym  Charles  Right   NaN

when I try to create a pivot table grouping first and last, it creates a table but does not group the names together. It creates row using all combinations of the first and last:

df2 = df.pivot_table(index=['first', 'last'], 
                     columns=['class'], 
                     aggfunc={'grade': max}, 
                     dropna=False).fillna(0)


                grade
class         English Gym History Math
first   last
Charles Jones       0   0       0    0
        Right       0   0       0    3
        Smith       0   0       0    0
John    Jones       0   0       0    0
        Right       0   0       0    0
        Smith       1   0       1    0
Mary    Jones       0   0       2    0
        Right       0   0       0    0
        Smith       0   0       0    0

I am trying to replicate the behavior of R’s dcast:

df2 <- dcast.data.table(df,first + last  ~ class, value.var ='grade')

   first   last     English  Gym       History Math
1: Charles Right    <NA>     <NA>      <NA>    3
2:    John Smith    1        <NA>      1       <NA>
3:    Mary Jones    <NA>     <NA>      2       <NA>

I realize if I set dropna=True it will just remove the extra rows but it will also remove the columns with NaNs in it and I do not want that. I need to preserve the columns.

crosstab dataframe pandas python r

M-- · Answer

You can replace the NaN with a specified number like -999, use the crosstab function, and later replace -999 with NaN. See below;
df1 = df.fillna(-999)

df2 = pd.crosstab(columns=df1['class'], index=[df1["first"],df1["last"]], 
                  values = df1['grade'], aggfunc={max})
df2[df2 == -999] = np.nan

max                    
class         English   Gym History  Math
first   last                             
Charles Right    None   NaN    None     3
John    Smith       1   NaN       1  None
Mary    Jones    None  None       2  None

Code Different · Answer

How about unstack:
df.set_index(['first', 'last', 'class']).unstack() 
  .droplevel(0, axis=1) 
  .rename_axis(None, axis=1) 
  .reset_index()

Create a pivot table using two columns treating them as a group

2 Answers

Add your own answers!

Ask a Question