TransWikia.com

How to compare data from the same column in a dataframe (Pandas)

Stack Overflow Asked by Abraham Arreola on January 8, 2021

I have a Panda’s dataframe like the following:

Countries dataframe

And i want to get the countries where its PIB in 2007 was less than in 2002, but i couldn´t code something to do that only using Pandas built in methods without use python iterations or something like that.
The most i’ve got is the following line:

df[df[df.year == 2007].PIB < df[df.year == 2002].PIB].country

But i get the following error:

ValueError: Can only compare identically-labeled Series objects

Till’ now i’ve only used Pandas to filter data from different columns, but i don’t know how to compare data from the same column, in this case the year.
Any support is welcome.

4 Answers

I suggest create Series with index by country column, but is necessary same number of countries in 2007 and 2002 for compare Series with same index values:

df = pd.DataFrame({'country': ['Afganistan', 'Zimbabwe', 'Afganistan', 'Zimbabwe'],
                  'PIB': [200, 200, 100, 300], 
                  'year': [2002, 2002, 2007, 2007]})
print (df)
      country  PIB  year
0  Afganistan  200  2002
1    Zimbabwe  200  2002
2  Afganistan  100  2007
3    Zimbabwe  300  2007

df = df.set_index('country')
print (df)
            PIB  year
country              
Afganistan  200  2002
Zimbabwe    200  2002
Afganistan  100  2007
Zimbabwe    300  2007

s1 = df.loc[df.year == 2007, 'PIB'] 
s2 = df.loc[df.year == 2002, 'PIB']
print (s1)
country
Afganistan    100
Zimbabwe      300
Name: PIB, dtype: int64

print (s2)
country
Afganistan    200
Zimbabwe      200
Name: PIB, dtype: int64

countries = s1.index[s1 < s2]
print (countries)
Index(['Afganistan'], dtype='object', name='country')

Another idea is first pivoting by DataFrame.pivot and then seelct columns by years and compare with index in boolean indexing:

df1 = df.pivot('country','year','PIB')
print (df1)
year        2002  2007
country               
Afganistan   200   100
Zimbabwe     200   300

countries = df1.index[df1[2007] < df1[2002]]
print (countries)
Index(['Afganistan'], dtype='object', name='country')

Correct answer by jezrael on January 8, 2021

Here is my dataframe :

df = pd.DataFrame([
    {"country": "a", "PIB": 2, "year": 2002},
    {"country": "b", "PIB": 2, "year": 2002},
    {"country": "a", "PIB": 1, "year": 2007},
    {"country": "b", "PIB": 3, "year": 2007},
])

If I filter the two year 2002 and 2007, I got.

df_2002 = df[df["year"] == 2007]
out : 
  country  PIB  year
0       a    2  2002
1       b    2  2002

df_2007 = df[df["year"] == 2007]
out : 
  country  PIB  year
2       a    1  2007
3       b    3  2007

You want to compare the evolution of the PIB for each country.

Pandas is unaware of that, it tries to compare values but here based on the same index. Witch is not what you want, and it's impossible because the indexes are different.

So you just need to use set_index()

df.set_index("country",  inplace=True)
df_2002 = df[df["year"] == 2007]
out : 
         PIB  year
country           
a          1  2007
b          3  2007

df_2007 = df[df["year"] == 2007]
out : 
         PIB  year
country           
a          2  2002
b          2  2002

now you can make the comparison

df_2002.PIB > df_2007.PIB
out:
country
a     True
b    False
Name: PIB, dtype: bool

# to get the list of countries
(df_2002.PIB > df_2007.PIB)[res == True].index.values.tolist()
out : 
['a']

Answered by AlexisG on January 8, 2021

My strategy is using the pivot_table. There is an assumption that there are no two rows having the same ('country','year') pair. With this assumption, aggfunc=np.sum represents the only single PIB value.

table = pd.pivot_table(df, values='PIB', index=['country'],
                    columns=['year'], aggfunc=np.sum)[[2002,2007]]
list(table[table[2002] > table[2007]].index)

The pivot_table looks like this:

The pivot_table looks like this

Answered by jhihan on January 8, 2021

Try this (considering that you want just the list of these countries):

[i for i in df.country if df[(df.country==i) & (df.year==2007)].PIB.iloc[0] < df[(df.country==i) & (df.year==2002)].PIB.iloc[0]]

Answered by IoaTzimas on January 8, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP