Stack Overflow Asked by sak on December 3, 2021
Working with a dataframe that looks like this (text version below):
I am supposed to calculate which country has scored the most goals since 2010 in tournaments. So far I have managed to manipulate the dataframe by filtering out friendlies like this:
no_friendlies = df[df.tournament != "Friendly"]
Then I set the date column to be the index in order to filter out all matches before 2010:
no_friendlies_indexed = no_friendlies.set_index('date')
since_2010 = no_friendlies_indexed.loc['2010-01-01':]
I am pretty lost from this point onward as I can’t figure out how to sum goals scored by each country both home and away
Any help/advice is appreciated!
EDIT:
Text version of sample data:
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
3 1875-03-06 England Scotland 2 2 Friendly London England False
4 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland False
5 1876-03-25 Scotland Wales 4 0 Friendly Glasgow Scotland False
6 1877-03-03 England Scotland 1 3 Friendly London England False
7 1877-03-05 Wales Scotland 0 2 Friendly Wrexham Wales False
8 1878-03-02 Scotland England 7 2 Friendly Glasgow Scotland False
9 1878-03-23 Scotland Wales 9 0 Friendly Glasgow Scotland False
10 1879-01-18 England Wales 2 1 Friendly London England False
EDIT 2:
I have just tried doing this:
since_2010.groupby(['home_team', 'home_score']).sum()
But it doesn’t return the sum of home goals scored by the home teams (if this worked i would just repeat it for away teams to get total)
Use pd.wide_to_long
to reshape. The benefit is it automatically creates a 'home_or_away'
indicator, but we will first change the columns so that they are 'score_home' (as opposed to 'home_score').
# Swap column stubs around `'_'`
df.columns = ['_'.join(x[::-1]) for x in df.columns.str.split('_')]
# Your code to filter, would drop everything in your provided example
# df['date'] = pd.to_datetime(df['date'])
# df[df['date'].dt.year.gt(2010) & df['tournament'].ne('Friendly')]
df = pd.wide_to_long(df, i='date', j='home_or_away',
stubnames=['team', 'score'], sep='_', suffix='.*')
# country neutral tournament city team score
#date home_or_away
#1872-11-30 home Scotland False Friendly Glasgow Scotland 0
#1873-03-08 home England False Friendly London England 4
#1874-03-07 home Scotland False Friendly Glasgow Scotland 2
#...
#1878-03-02 away Scotland False Friendly Glasgow England 2
#1878-03-23 away Scotland False Friendly Glasgow Wales 0
#1879-01-18 away England False Friendly London Wales 1
So now regardless of home or away, you can get the points scored:
df.groupby('team')['score'].sum()
#team
#England 12
#Scotland 34
#Wales 1
#Name: score, dtype: int64
Answered by ALollz on December 3, 2021
.groupby
and .sum()
for the home team and then do the same for the away team and add the two together:
df_new = df.groupby('home_team')['home_score'].sum() + df.groupby('away_team')['away_score'].sum()
output:
England 12
Scotland 34
Wales 1
More detailed explanation (per comment):
.groupby
one column home_team
. In your answer, you were grouping by ['home_team', 'home_score']
Your goal (no pun intended) is to get the .sum()
of the home_score
-- so you should NOT .groupby()
it. As you can see ['home_score']
is after the part where I use .groupby
, so that I can get the .sum()
of it. That gets you set for the home teams.away_team
.home_team
and away_team
groups have the same values for countries, you can simply add them together...Answered by David Erickson on December 3, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP