Stack Overflow Asked by Maddy6 on January 3, 2022
I have below dataframe called "df" and calculating the last amount sum by unique id called
import pandas as pd
from dateutil import parser
from datetime import datetime, timedelta
df= {'Date':['2019-01-11 10:23:45','2019-01-09 10:23:45', '2019-01-11 10:27:45',
'2019-01-11 10:25:45', '2019-01-11 10:30:45', '2019-01-11 10:35:45',
'2019-02-09 10:25:45'],
'Fruit id':['100','200','300','100','100', '100','200'],
'X':[200,400,330,100,300,200,500],
}
df= pd.DataFrame(df)
df["Date"] = pd.to_datetime(df['Date'])
If you only want the newest 30-minute window (not the full data set with a million rows), then you could use pd.Timedelta()
:
# find Dates in the 30-minute window ending at the max Date
mask = (df['Dates'].max() - df['Dates']) < pd.Timedelta('30min')
df_recent = df[mask]
Now compute summary stats on df_recent
Answered by jsmart on January 3, 2022
Let's try with groupby:
df['count_ncc' = (df.set_index('Date')
.groupby(['Fruit id','NCC'])
['Amount'].transform(lambda x:
x.rolling('30min', closed='left').sum()
)
.values
)
Or with cross-merge:
df['count_ncc'] = (df.merge(df.assign(Date_shift=df.Date-pd.to_timedelta('30M'),
idx=df.index),
on=['Fruit id', 'NCC'])
.query('Date_shift <= Date_x < Date_y')
.groupby('idx')['Amount_x'].sum()
)
Output:
Date Fruit id NCC Amount Sys count_ncc
0 2019-01-11 10:23:45 100 100 200 1 NaN
1 2019-01-09 10:23:45 200 100 400 0 NaN
2 2019-01-11 10:27:45 300 200 330 1 NaN
3 2019-01-11 10:25:45 100 100 100 0 200.0
4 2019-01-11 10:30:45 100 100 300 1 300.0
5 2019-01-11 10:35:45 100 100 200 0 600.0
6 2019-02-09 10:25:45 200 100 500 1 NaN
Answered by Quang Hoang on January 3, 2022
The part that jumps out is you're filtering the whole df once for each row, and I guess you only get a small fraction of rows for each time round.
I'll try to write the full code later BUT you can try to fix it with pointers:
i, k, sdate = 0, 0, df.Date.iloc[0]
while df.Date.iloc[k] - sdate < timedelta(seconds=1800):
k += 1
k -= 1 # found first row within 30min after start
count = df.iloc[i:k, :].groupby(['Fruit_id', 'NCC']).Amount.sum() # should be faster to do all tags instead of filtering because it's implemented in C
s.append(count.loc[(df.loc[k, 'Fruit_id'], df.loc[k, 'NCC'])])
while k < len(df):
k += 1
fdate = df.Date.iloc[k]
while fdate - df.Date.iloc[i] > timedelta(seconds=1800):
i += 1
# repeat the groupby and append, probably should define a function
Answered by RichieV on January 3, 2022
pivot_table
could be useful here.
df.sort_values(by='Date', inplace=True)
newdf = pd.pivot_table(df, columns='Fruit id', index='Date', aggfunc=np.sum, values='Amount').rolling('30min', closed='left').sum().sort_index()
newdf['Fruit id'] = df['Fruit id'].values
df['count_ncc_amt'] = newdf.apply(lambda row: row[row['Fruit id']], axis=1).values
print(df)
Date Fruit id NCC Amount Sys count_ncc_amt
1 2019-01-09 10:23:45 200 100 400 0 NaN
0 2019-01-11 10:23:45 100 100 200 1 NaN
3 2019-01-11 10:25:45 100 100 100 0 200.0
2 2019-01-11 10:27:45 300 200 330 1 NaN
4 2019-01-11 10:30:45 100 100 300 1 300.0
5 2019-01-11 10:35:45 100 100 200 0 600.0
6 2019-02-09 10:25:45 200 100 500 1 NaN
Answered by Rm4n on January 3, 2022
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP