Stack Overflow Asked by Chris Reiche on December 18, 2021
I have seen plenty of answers that address replicating R’s dcast functionality using a single column as the index but I am having a hard time replicating a dcast where you are combining columns to group them and then pivoting. I keep running into issues when I try to use pivot_table or crosstab where I end up with either dropped columns or mixed up names.
I have a DataFrame that looks like:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['History', 'History', 'English','Math','Gym', 'Gym'],
'first': ['John','Mary','John', 'Charles', 'John', 'Charles'],
'last': ['Smith', 'Jones', 'Smith', 'Right', 'Smith', 'Right'],
'grade': ['1','2','1','3', np.nan, np.nan] })
class first last grade
0 History John Smith 1
1 History Mary Jones 2
2 English John Smith 1
3 Math Charles Right 3
4 Gym John Smith NaN
5 Gym Charles Right NaN
when I try to create a pivot table grouping first and last, it creates a table but does not group the names together. It creates row using all combinations of the first and last:
df2 = df.pivot_table(index=['first', 'last'],
columns=['class'],
aggfunc={'grade': max},
dropna=False).fillna(0)
grade
class English Gym History Math
first last
Charles Jones 0 0 0 0
Right 0 0 0 3
Smith 0 0 0 0
John Jones 0 0 0 0
Right 0 0 0 0
Smith 1 0 1 0
Mary Jones 0 0 2 0
Right 0 0 0 0
Smith 0 0 0 0
I am trying to replicate the behavior of R’s dcast:
df2 <- dcast.data.table(df,first + last ~ class, value.var ='grade')
first last English Gym History Math
1: Charles Right <NA> <NA> <NA> 3
2: John Smith 1 <NA> 1 <NA>
3: Mary Jones <NA> <NA> 2 <NA>
I realize if I set dropna=True it will just remove the extra rows but it will also remove the columns with NaNs in it and I do not want that. I need to preserve the columns.
You can replace the NaN
with a specified number like -999
, use the crosstab
function, and later replace -999
with NaN
. See below;
df1 = df.fillna(-999)
df2 = pd.crosstab(columns=df1['class'], index=[df1["first"],df1["last"]],
values = df1['grade'], aggfunc={max})
df2[df2 == -999] = np.nan
max
class English Gym History Math
first last
Charles Right None NaN None 3
John Smith 1 NaN 1 None
Mary Jones None None 2 None
Answered by M-- on December 18, 2021
How about unstack
:
df.set_index(['first', 'last', 'class']).unstack()
.droplevel(0, axis=1)
.rename_axis(None, axis=1)
.reset_index()
Answered by Code Different on December 18, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP