TransWikia.com

Take several points for each period in a machine learning model

Data Science Asked by LeoGlt on June 9, 2021

Problem presentation

I am working on a prediction model where I must find out if a boat will go back to the same offshore workplace after spending time in a port. When a boat is in a port, it can stay here for a few hours or a few days and go back to its working location, or it can go to another place to work on another project. So I want to answer the question ‘Will the boat go back to the same workplace after leaving the port ?’

To answer this, I have a dataframe with ashore periods, with some features that can help to deduce what the boat will do :

Out[1]: 
     boat  ashore_period_id      port start_ashore  end_ashore  workplace_dist  same_workplace_after_ashore
0  boat 1                 1  Le Havre   2021-01-01  2021-01-05             450                        False
1  boat 2                 2   Dunkirk   2021-01-01  2021-01-02              20                         True

Each row corresponds to an ashore period, we can guess that if the boat is in a port very far from its workplace (workplace_dist column), there is few chances that it go back to its workplace. We can also guess that if the boat stay in the port for a very long time, it will probably go to another workplace after leaving the port.

The predictions for the variable ‘same_workplace_after_ashore’ will be made at several moments of the ashore period, they won’t be made only at the beginning or the end of the period. That’s why I have decided to take several points in each period in my training set, 1 point per day. This is how the dataframe looks after the transformation :

Out[2]: 
     boat  ashore_period_id      port start_ashore        date  nb_day_ashore  workplace_dist  same_workplace_after_ashore
0  boat 1                 1  Le Havre   2021-01-01  2021-01-01              0             450                        False
1  boat 1                 1  Le Havre   2021-01-01  2021-01-02              1             450                        False
2  boat 1                 1  Le Havre   2021-01-01  2021-01-03              2             450                        False
3  boat 1                 1  Le Havre   2021-01-01  2021-01-04              3             450                        False
4  boat 1                 1  Le Havre   2021-01-01  2021-01-05              4             450                        False
5  boat 2                 2   Dunkirk   2021-01-01  2021-01-01              0              20                         True
6  boat 2                 2   Dunkirk   2021-01-01  2021-01-02              1              20                         True

I now have 5 rows for the first ashore period, and 2 for the second, and only the column ‘nb_day_ashore’ is changing for a same period. I used this second dataframe to train a random forest.

I have pretty good results, but I wonder if this method is correct, I don’t know if it is okay to take several points for each period. We can see in the dataframe that we have more rows for the first ashore period than for the second period. It could create a bias in favour of the long period ashore because the longer periods will have more rows, so it will have more weight in the model training and testing.

Question

Do you know it is is okay to do like this ? Or should I process my data with another method ?

Ideas

  • Instead of taking 1 row per day and per ashore period, I could take only 1 (or several) date(s) that I would choose randomly in the periods. But it would reduce the size of my training set by a lot.
  • I could attribute weight to ashore periods depending on their durations.

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP