sum for id'sin python

Question

I have below dataframe called "df" and calculating the last amount sum by unique id called
import pandas as pd
from dateutil import parser
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
             '2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
             '2019-02-09 10:25:45'],
     'Fruit id':['100','200','300','100','100', '100','200'],
     'X':[200,400,330,100,300,200,500],
  
     }
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])

jsmart · Answer

If you only want the newest 30-minute window (not the full data set with a million rows), then you could use pd.Timedelta():
# find Dates in the 30-minute window ending at the max Date
mask = (df['Dates'].max() - df['Dates']) < pd.Timedelta('30min')
df_recent = df[mask]

Now compute summary stats on df_recent

Quang Hoang · Answer

Let's try with groupby:
df['count_ncc' = (df.set_index('Date')
                    .groupby(['Fruit id','NCC'])
                    ['Amount'].transform(lambda x: 
                                         x.rolling('30min', closed='left').sum()  
                                        )
                    .values
                 )

Or with cross-merge:
df['count_ncc'] = (df.merge(df.assign(Date_shift=df.Date-pd.to_timedelta('30M'),
                    idx=df.index),
          on=['Fruit id', 'NCC'])
    .query('Date_shift <= Date_x < Date_y')
    .groupby('idx')['Amount_x'].sum()
)

Output:
                 Date Fruit id  NCC  Amount  Sys  count_ncc
0 2019-01-11 10:23:45      100  100     200    1        NaN
1 2019-01-09 10:23:45      200  100     400    0        NaN
2 2019-01-11 10:27:45      300  200     330    1        NaN
3 2019-01-11 10:25:45      100  100     100    0      200.0
4 2019-01-11 10:30:45      100  100     300    1      300.0
5 2019-01-11 10:35:45      100  100     200    0      600.0
6 2019-02-09 10:25:45      200  100     500    1        NaN

RichieV · Answer

The part that jumps out is you're filtering the whole df once for each row, and I guess you only get a small fraction of rows for each time round.
I'll try to write the full code later BUT you can try to fix it with pointers:
i, k, sdate = 0, 0, df.Date.iloc[0]
while df.Date.iloc[k] - sdate < timedelta(seconds=1800):
    k += 1
k -= 1 # found first row within 30min after start
count = df.iloc[i:k, :].groupby(['Fruit_id', 'NCC']).Amount.sum() # should be faster to do all tags instead of filtering because it's implemented in C
s.append(count.loc[(df.loc[k, 'Fruit_id'], df.loc[k, 'NCC'])])
while k < len(df):
    k += 1
    fdate = df.Date.iloc[k]
    while fdate - df.Date.iloc[i] > timedelta(seconds=1800):
        i += 1
    # repeat the groupby and append, probably should define a function

Rm4n · Answer

pivot_table could be useful here.
df.sort_values(by='Date', inplace=True)
newdf = pd.pivot_table(df, columns='Fruit id', index='Date', aggfunc=np.sum, values='Amount').rolling('30min', closed='left').sum().sort_index()
newdf['Fruit id'] = df['Fruit id'].values
df['count_ncc_amt'] = newdf.apply(lambda row: row[row['Fruit id']], axis=1).values
print(df)

Date Fruit id  NCC  Amount  Sys  count_ncc_amt
1 2019-01-09 10:23:45      200  100     400    0            NaN
0 2019-01-11 10:23:45      100  100     200    1            NaN
3 2019-01-11 10:25:45      100  100     100    0          200.0
2 2019-01-11 10:27:45      300  200     330    1            NaN
4 2019-01-11 10:30:45      100  100     300    1          300.0
5 2019-01-11 10:35:45      100  100     200    0          600.0
6 2019-02-09 10:25:45      200  100     500    1            NaN

sum for id'sin python

4 Answers

Add your own answers!

Ask a Question