Data Science Asked by LeoGlt on June 9, 2021
I am working on a prediction model where I must find out if a boat will go back to the same offshore workplace after spending time in a port. When a boat is in a port, it can stay here for a few hours or a few days and go back to its working location, or it can go to another place to work on another project. So I want to answer the question ‘Will the boat go back to the same workplace after leaving the port ?’
To answer this, I have a dataframe with ashore periods, with some features that can help to deduce what the boat will do :
Out[1]:
boat ashore_period_id port start_ashore end_ashore workplace_dist same_workplace_after_ashore
0 boat 1 1 Le Havre 2021-01-01 2021-01-05 450 False
1 boat 2 2 Dunkirk 2021-01-01 2021-01-02 20 True
Each row corresponds to an ashore period, we can guess that if the boat is in a port very far from its workplace (workplace_dist column), there is few chances that it go back to its workplace. We can also guess that if the boat stay in the port for a very long time, it will probably go to another workplace after leaving the port.
The predictions for the variable ‘same_workplace_after_ashore’ will be made at several moments of the ashore period, they won’t be made only at the beginning or the end of the period. That’s why I have decided to take several points in each period in my training set, 1 point per day. This is how the dataframe looks after the transformation :
Out[2]:
boat ashore_period_id port start_ashore date nb_day_ashore workplace_dist same_workplace_after_ashore
0 boat 1 1 Le Havre 2021-01-01 2021-01-01 0 450 False
1 boat 1 1 Le Havre 2021-01-01 2021-01-02 1 450 False
2 boat 1 1 Le Havre 2021-01-01 2021-01-03 2 450 False
3 boat 1 1 Le Havre 2021-01-01 2021-01-04 3 450 False
4 boat 1 1 Le Havre 2021-01-01 2021-01-05 4 450 False
5 boat 2 2 Dunkirk 2021-01-01 2021-01-01 0 20 True
6 boat 2 2 Dunkirk 2021-01-01 2021-01-02 1 20 True
I now have 5 rows for the first ashore period, and 2 for the second, and only the column ‘nb_day_ashore’ is changing for a same period. I used this second dataframe to train a random forest.
I have pretty good results, but I wonder if this method is correct, I don’t know if it is okay to take several points for each period. We can see in the dataframe that we have more rows for the first ashore period than for the second period. It could create a bias in favour of the long period ashore because the longer periods will have more rows, so it will have more weight in the model training and testing.
Do you know it is is okay to do like this ? Or should I process my data with another method ?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP