Stack Overflow Asked by np8 on December 8, 2020
Why the sorting using pd.Series.sort_index does not seem to work when using categorical index? How would one sort the index of the multi-index pd.Series using other sort order than alphabetical/numerical?
import pandas as pd
import numpy as np
d = {
'Card': [
'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
],
'Year': [
'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
'One', 'Two', 'Four', 'Six', 'Six'
],
'Value': [
45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
]
}
df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()
Using naively just sort_index
, the data looks like this:
In [2]: ser_val.sort_index()
Out[2]:
Card Year
Master Card Five 294.000000
Four 755.000000
One 2401.000000
Seven 505.000000
Three 146.833333
Two 321.000000
Visa Seven 438.166667
Six 15.000000
Three 84.500000
Name: Value, dtype: float64
you can see that the columns are sorted alphabetically. Now, I want to force the ordering. For that, I try:
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)
again, after sorting, the data looks like this (again, alphabetical order)
In [3]: ser_val.sort_index()
Out[3]:
Card Year
Master Card Five 294.000000
Four 755.000000
One 2401.000000
Seven 505.000000
Three 146.833333
Two 321.000000
Visa Seven 438.166667
Six 15.000000
Three 84.500000
Name: Value, dtype: float64
I know that if I convert the data into pandas.DataFrame and sort there, it works, like this:
df_val = ser_val.reset_index().sort_values(grp_cols)
df_val['Year'] = pd.Categorical(df_val['Year'].values,
categories_order,
ordered=True)
df_val = df_val.sort_values(grp_cols).set_index(grp_cols)
In [5]: df_val
Out[5]:
Value
Card Year
Master Card One 2401.000000
Two 321.000000
Three 146.833333
Four 755.000000
Five 294.000000
Seven 505.000000
Visa Three 84.500000
Six 15.000000
Seven 438.166667
Why won’t the pd.Series sort with categorical data?
I am using pandas 1.0.5 in Python 3.7.3 64-bit
import pandas as pd
import numpy as np
d = {
'Card': [
'Visa', 'Visa', 'Master Card', 'Master Card', 'Visa', 'Master Card',
'Visa', 'Visa', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
'Master Card', 'Master Card', 'Visa', 'Master Card', 'Visa', 'Visa',
'Master Card', 'Visa', 'Master Card', 'Master Card', 'Master Card',
'Master Card', 'Master Card', 'Master Card', 'Visa', 'Visa'
],
'Year': [
'Three', 'Three', 'Seven', 'Three', 'Three', 'Seven', 'Seven', 'Seven',
'Three', 'Seven', 'Three', 'Three', 'Three', 'Seven', 'Three', 'Three',
'Seven', 'Seven', 'Seven', 'Three', 'Seven', 'Three', 'Five', 'One',
'One', 'Two', 'Four', 'Six', 'Six'
],
'Value': [
45, 13, 52, 321, 31, 1231, 876, 231, 4, 213, 123, 45, 321, 1, 123, 52,
736, 35, 900, 301, 374, 9, 294, 337, 4465, 321, 755, 22, 8
]
}
df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols)['Value'].mean()
#---------------------------------------------
#Attempt 1st
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index.set_levels(categories, level=1, inplace=True)
print(ser_val.index.levels[1].dtype)
#--------------------------------------------
#Attempt 2nd
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.Categorical(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index = ser_val.index.set_levels(categories, level='Year')
print(ser_val.index.levels[1].dtype)
#----------------------------------
#Attempt 3rd and success
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categoriesDtype = pd.CategoricalDtype(categories_order, ordered=True)
ser_val.index = ser_val.index.set_levels(ser_val.index.levels[1].astype(categoriesDtype), level='Year')
print(ser_val.index.levels[1].dtype)
Output:
object *FAILED change type using inplace*
object *FAILED change type using reassignment*
category *SUCCESS change type using pd.CategoricalDtype*
After a successful change of dype for index level 1:
ser_val.index.levels[1]
Output:
CategoricalIndex(['Five', 'Four', 'One', 'Seven', 'Six', 'Three', 'Two'], categories=['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven'], ordered=True, name='Year', dtype='category')
Now, let's sort the dataframe using sort_index:
ser_val.sort_index()
Output (FAIL):
Card Year
Master Card Five 294.000000
Four 755.000000
One 2401.000000
Seven 505.000000
Three 146.833333
Two 321.000000
Visa Seven 438.166667
Six 15.000000
Three 84.500000
Name: Value, dtype: float64
Now, just for kicks and testing, let's swap index levels and try sort_index again.
ser_val.swaplevel(0,1).sort_index()
Output(SUCCESS):
Year Card
One Master Card 2401.000000
Two Master Card 321.000000
Three Master Card 146.833333
Visa 84.500000
Four Master Card 755.000000
Five Master Card 294.000000
Six Visa 15.000000
Seven Master Card 505.000000
Visa 438.166667
Name: Value, dtype: float64
However, if we explicitly set the sorting levels... FAIL again.
ser_val.swaplevel(0,1).sort_index(level=[0,1])
Output:
Year Card
Five Master Card 294.000000
Four Master Card 755.000000
One Master Card 2401.000000
Seven Master Card 505.000000
Visa 438.166667
Six Visa 15.000000
Three Master Card 146.833333
Visa 84.500000
Two Master Card 321.000000
Name: Value, dtype: float64
Answered by Scott Boston on December 8, 2020
TLDR: You need to set sort=False
in your groupby
, and you need to change your Categorical
to a CategoricalIndex
. Here's the full working example:
df = pd.DataFrame(d)
grp_cols = ['Card', 'Year']
ser_val = df.groupby(grp_cols, sort=False)['Value'].mean()
categories_order = ['One', 'Two', 'Three', 'Four', 'Five', 'Six', 'Seven']
categories = pd.CategoricalIndex(ser_val.index.levels[1].values,
categories=categories_order,
ordered=True)
ser_val.index.set_levels(categories, level='Year', inplace=True)
ser_val.sort_index(inplace=True)
And ser_val
is now:
Card Year
Master Card One 2401.000000
Two 321.000000
Three 146.833333
Four 755.000000
Five 294.000000
Seven 505.000000
Visa Three 84.500000
Six 15.000000
Seven 438.166667
Name: Value, dtype: float64
Longer Ramblings: You ask why what you did doesn't work, which I certainly cannot explain (I'm sure that involves some rooting around in the source), but here is how I arrived at my solution.
See that the following example works, constructing a toy MultiIndex Series from scratch:
lets = ['a','b','c']*3
ids = ['MALE']*4 + ['FEMALE']*5
s = pd.Series(range(9), index=[ids,lets])
categories_order = ['b','a','c']
categories = pd.CategoricalIndex(s.index.levels[1].values,
categories=categories_order,
ordered=True)
s.index.set_levels(categories, level=1,inplace=True)
s.sort_index(inplace=True)
s
is sorted as we want:
FEMALE b 4
b 7
a 6
c 5
c 8
MALE b 1
a 0
a 3
c 2
dtype: int64
The only significant difference between your example and mine (that I could tell) is that yours starts from groupby
. There is a sort
parameter of groupby
:
sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
So it seems like the groupby
sort is enforcing some order that is not being overridden by your new categorical order.
But still with sort=False
, your code as is didn't work. Just through googling I found that there were separate classes for Categorical
and CategoricalIndex
, and apparently the latter is what you need here. And sure enough my example will also fail if Categorical
is used instead of CategoricalIndex
.
So the groupby
thing seems like the more curious issue; again the underlying rules here I can't tell you, but maybe someone can elaborate.
Answered by Tom on December 8, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP