TransWikia.com

How to sum values of one column based on other columns in pandas?

Stack Overflow Asked by sak on December 3, 2021

Working with a dataframe that looks like this (text version below):
enter image description here

I am supposed to calculate which country has scored the most goals since 2010 in tournaments. So far I have managed to manipulate the dataframe by filtering out friendlies like this:

no_friendlies = df[df.tournament != "Friendly"]

Then I set the date column to be the index in order to filter out all matches before 2010:

no_friendlies_indexed = no_friendlies.set_index('date')
since_2010 = no_friendlies_indexed.loc['2010-01-01':]

I am pretty lost from this point onward as I can’t figure out how to sum goals scored by each country both home and away

Any help/advice is appreciated!

EDIT:

Text version of sample data:

date    home_team   away_team   home_score  away_score  tournament  city    country     neutral
0   1872-11-30  Scotland    England     0   0       Friendly    Glasgow     Scotland    False
1   1873-03-08  England     Scotland    4   2       Friendly    London  England     False
2   1874-03-07  Scotland    England     2   1       Friendly    Glasgow     Scotland    False
3   1875-03-06  England     Scotland    2   2       Friendly    London  England     False
4   1876-03-04  Scotland    England     3   0       Friendly    Glasgow     Scotland    False
5   1876-03-25  Scotland    Wales       4   0       Friendly    Glasgow     Scotland    False
6   1877-03-03  England     Scotland    1   3       Friendly    London  England     False
7   1877-03-05  Wales       Scotland    0   2       Friendly    Wrexham     Wales   False
8   1878-03-02  Scotland    England     7   2       Friendly    Glasgow     Scotland    False
9   1878-03-23  Scotland    Wales       9   0       Friendly    Glasgow     Scotland    False
10  1879-01-18  England     Wales       2   1       Friendly    London  England     False

EDIT 2:

I have just tried doing this:

since_2010.groupby(['home_team', 'home_score']).sum()

But it doesn’t return the sum of home goals scored by the home teams (if this worked i would just repeat it for away teams to get total)

2 Answers

Use pd.wide_to_long to reshape. The benefit is it automatically creates a 'home_or_away' indicator, but we will first change the columns so that they are 'score_home' (as opposed to 'home_score').

# Swap column stubs around `'_'`
df.columns = ['_'.join(x[::-1]) for x in df.columns.str.split('_')]

# Your code to filter, would drop everything in your provided example
# df['date'] = pd.to_datetime(df['date'])
# df[df['date'].dt.year.gt(2010) & df['tournament'].ne('Friendly')]

df = pd.wide_to_long(df, i='date', j='home_or_away',
                     stubnames=['team', 'score'], sep='_', suffix='.*')

#                          country  neutral tournament     city      team  score
#date       home_or_away                                                        
#1872-11-30 home          Scotland    False   Friendly  Glasgow  Scotland      0
#1873-03-08 home           England    False   Friendly   London   England      4
#1874-03-07 home          Scotland    False   Friendly  Glasgow  Scotland      2
#...
#1878-03-02 away          Scotland    False   Friendly  Glasgow   England      2
#1878-03-23 away          Scotland    False   Friendly  Glasgow     Wales      0
#1879-01-18 away           England    False   Friendly   London     Wales      1

So now regardless of home or away, you can get the points scored:

df.groupby('team')['score'].sum()
#team
#England     12
#Scotland    34
#Wales        1
#Name: score, dtype: int64

Answered by ALollz on December 3, 2021

.groupby and .sum() for the home team and then do the same for the away team and add the two together:

df_new = df.groupby('home_team')['home_score'].sum() + df.groupby('away_team')['away_score'].sum()

output:

England     12
Scotland    34
Wales        1

More detailed explanation (per comment):

  1. You need to only .groupby one column home_team. In your answer, you were grouping by ['home_team', 'home_score'] Your goal (no pun intended) is to get the .sum() of the home_score -- so you should NOT .groupby() it. As you can see ['home_score'] is after the part where I use .groupby, so that I can get the .sum() of it. That gets you set for the home teams.
  2. Then, you do the same for the away_team.
  3. At that point python / pandas is smart enough that since the results of the home_team and away_team groups have the same values for countries, you can simply add them together...

Answered by David Erickson on December 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP