Stack Overflow Asked by Jordan Ford on February 1, 2021
I have dataset which has forecast data which is updated ambiguously over a 42 hour period. Here is a sample:
df_old = pd.DataFrame({'IssueDatetime': ['2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00','2010-01-01 09:00:00'],
'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','NORTHEAST COAST','NORTHEAST COAST'],
'forecastTime': ['2010-01-01 09:00:00','2010-01-01 15:00:00','2010-01-01 19:00:00','2010-01-01 09:00:00','2010-01-01 12:00:00'],
'forecast_Dir':[150,180,45,45,45],
'windSpeed':[20,90,35,45,15]})
The issue is the gaps between the hours of df[‘forecastTime’] and df[‘endtime]. I have tried using my limited pandas knowledge to group and resample the data, but, because the dates are repeated I cannot get a datetime index.
Ultimately my goal is to expand the dataframe so the hours in between the original hours in the dataframe have rows of their own right up until the end period…
Example of desired output:
df_new = pd.DataFrame({'IssueDatetime': [ '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00', '2010-01-01 09:00:00','2010-01-01 09:00:00'],
'endtime':['2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00','2010-01-03 03:00:00'],
'Regions': ['EAST COAST-CAPE ST FRANCIS AND SOUTH', 'EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH','EAST COAST-CAPE ST FRANCIS AND SOUTH'],
'forecastTime': ['2010-01-01 09:00:00','2010-01-01 10:00:00','2010-01-01 11:00:00','2010-01-01 12:00:00','2010-01-01 13:00:00','2010-01-01 14:00:00','2010-01-01 15:00:00'],
'forecast_Dir':[150,150,150,150,150,150,180],
'windSpeed':[20,20,20,20,20,20,90]})
Note for the first region, the hours between df[‘forecastTime’] = ‘2010-01-01 09:00:00’ and df[‘forecastTime’] = ‘2010-01-01 15:00:00’ should be rows of their own. Essentially I’m looking to upsample to fill in missing hours.
EDIT: – Orginal dataframe
IssueDatetime endtime
0 2013-01-01 09:00:00 2013-01-03 03:00:00
1 2013-01-01 09:00:00 2013-01-03 03:00:00
2 2013-01-01 09:00:00 2013-01-03 03:00:00
3 2013-01-01 09:00:00 2013-01-03 03:00:00
4 2013-01-01 09:00:00 2013-01-03 03:00:00
... ... ...
53585 2016-12-30 09:00:00 2017-01-01 03:00:00
53586 2016-12-30 09:00:00 2017-01-01 03:00:00
53587 2016-12-30 09:00:00 2017-01-01 03:00:00
53588 2016-12-30 09:00:00 2017-01-01 03:00:00
53589 2016-12-30 09:00:00 2017-01-01 03:00:00
Regions forecastTime
0 SOUTH COAST 2013-01-01 09:00:00
1 SOUTH COAST 2013-01-01 18:00:00
2 SOUTH COAST 2013-01-02 06:00:00
3 SOUTH COAST 2013-01-02 13:00:00
4 EAST COAST-CAPE ST FRANCIS AND SOUTH 2013-01-01 09:00:00
... ... ...
53585 SOUTHWESTERN GRAND BANKS 2016-12-30 18:00:00
53586 SOUTHWESTERN GRAND BANKS 2016-12-31 09:00:00
53587 SOUTHWESTERN GRAND BANKS 2016-12-31 15:00:00
53588 SOUTHWESTERN GRAND BANKS 2016-12-31 18:00:00
53589 SOUTHWESTERN GRAND BANKS 2017-01-01 00:00:00
forecastHour forecast_Dir forecast_WindSpeed_low
0 0.0 270 35
1 9.0 270 25
2 21.0 225 15
3 28.0 270 35
4 0.0 270 35
... ... ... ...
53585 9.0 135 40
53586 24.0 135 40
53587 30.0 135 40
53588 33.0 315 25
53589 39.0 315 25
forecast_WindSpeed_gust forecast_WindSpeed_high
0 None None
1 None None
2 None None
3 None None
4 None None
... ... ...
53585 None 50
53586 None 50
53587 None 50
53588 None 35
53589 None None
forecast_WindSpeed_exception_1_type forecast_Dir_exception_1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
53585 NaN NaN
53586 OVER NORTHWESTERN SECTIONS 315
53587 NaN NaN
53588 NaN NaN
53589 NaN NaN
forecast_WindSpeed_low_exception_1 forecast_WindSpeed_high_exception_1
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
... ... ...
53585 NaN NaN
53586 25 None
53587 NaN NaN
53588 NaN NaN
53589 NaN NaN
I've partially answered my own question... I was still getting a multi-index error so I added a millisecond to each forecastTime per row to make them unique:
df_old['forecastTime'] = df_old['forecastTime'] + pd.to_timedelta(df_old.groupby('forecastTime').cumcount(), unit='ms')
df_old['forecastTime'] = pd.to_datetime(df_old['forecastTime'])
df_new = df_old.set_index('forecastTime')
.groupby('Regions',as_index=False)
.resample('H').ffill().droplevel(0).reset_index()
Now, I need to add an end time. Any suggestions to ending the .ffill() exactly 42 hours after the stat time of each forecast?
Answered by Jordan Ford on February 1, 2021
IIUC, first you need to convert the 'forecastTime'
column to datetime, then you set 'forecastTime'
column as index to do the resample, after that group by 'Regions'
, do the resample
by hours, and fill the NaN values with ffill
:
df_old['forecastTime'] = pd.to_datetime(df_old['forecastTime'])
df_new = df_old.set_index('forecastTime')
.groupby('Regions',as_index=False)
.resample('H').ffill().droplevel(0).reset_index()
print(df_new.head())
To avoid a possible error(non-unique index) because of duplicate dates, you can try this:
df_new = df_old.groupby('Regions',as_index=False)
.apply(lambda x: x.set_index('forecastTime')
.resample('H').ffill()).droplevel(0).reset_index()
Output:
forecastTime IssueDatetime endtime Regions forecast_Dir windSpeed
0 2010-01-01 09:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
1 2010-01-01 10:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
2 2010-01-01 11:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
3 2010-01-01 12:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
4 2010-01-01 13:00:00 2010-01-01 09:00:00 2010-01-03 03:00:00 EAST COAST-CAPE ST FRANCIS AND SOUTH 150 20
Answered by MrNobody33 on February 1, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP