TransWikia.com

How to filter rows in Python pandas dataframe with duplicate values in the columns to be filtere

Data Science Asked on March 24, 2021

Overall context:

I have a data frame that contains observations for every five minute starting at 5 AM in the morning and ending at 8 PM in the evening for several days. I need to filter all the observations that start from 9 AM in the morning and end at 5 PM in the evening for every day.

The input data frame looks like this:

Date Time
2019-09-20 05:00:00,..,..
2019-09-20 05:05:00,..,..
...
2019-09-20 09:00:00,..,..
...
2019-09-20 17:00:00,..,..
2019-09-20 17:05:00,..,..
...
2019-09-20 20:00:00,..,..
2019-09-21 05:00:00,..,..
2019-09-21 05:05:00,..,..
...
2019-09-21 09:00:00,..,..
...
2019-09-21 17:00:00,..,..
2019-09-21 17:05:00,..,..
...
2019-09-21 20:00:00,..,..

and the output data frame should look like this:

2019-09-20 09:00:00,..,..
...
2019-09-20 17:00:00,..,..
2019-09-21 09:00:00,..,..
...
2019-09-21 17:00:00,..,..

Steps taken so far

In order to extract the rows between 9 am and 5 pm, I determined the number of seconds since midnight for every row by
extracting the hours, minutes and seconds using vectorized data operations
so input dataframe will have column like:

Date Time, Number of seconds since midnight
2019-09-20 05:00:00,xxxx,..,..
2019-09-20 05:05:00,yyyy,..,..
...
2019-09-21,05:00:00,xxxx,..,..
2019-09-21, 05:05:00,yyyy,..,..

Note that for the same time on every day, the number of seconds will remain the same
Now I was hoping to extract alll the rows between 9 am and 5 pm by

df[(df['Number of seconds since midnight'] > (nseconds for 9 am from midnight)) &  ((df['Number of seconds since midnight'] < (nseconds for 5 pm from midnight))

but I get the rows from only the last date between 9am and 5 pm.
TO me, it looks it is ignoring all the duplicate rows with the same time.

Can anyone suggest a possible solution that does not iterate over each row and uses the vectorized operations as the database is very large

One Answer

I think you have defined midnight as today's 00:00. Therefore, the rows before today are out of your range.

I think this may work for this cases:

# Convert string to datetime format
df['Date Time'] = pd.to_datetime(df['Date Time'])

selected_rows = df[((df['Date Time'].dt.hour * 60 + df['Date Time'].dt.minute) >= 9 * 60) & 
                   ((df['Date Time'].dt.hour * 60 + df['Date Time'].dt.minute) <= 17 * 60)]

The filter rules use the time only and ignores the date.

Answered by Felix Chan on March 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP