Classification with feature not available at time of model creation

Question

I have problem statement to predict the probability of solving a task depending on multiple features for e.g. when the task was created, the time needed to work on a task, etc Please find a dummy snippet attached

task_id  date_time_open    time_needed   day_created  time_created   status

aa      12/09/2019             20 hrs     Tuesday        3 pm      done  
cc      17/10/2019             4 hrs      Friday        10 pm      not_done

I know I can run a classification model to identify the class. However, things complicate when  I add a time dimension to it since the data set now gets an added feature which highly impacts the status

The task was scanned at suppose 7 pm and a new feature added for 7 pm

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_7pm     status

aa      12/09/2019               20 hrs     tuesday       3pm              done           done 
    cc      17/10/2019               4 hrs      friday        10 pm            done           not_done
    dd      19/10/2019               6 hrs      friday        2 pm             done          done 
    ff      19/10/2019               9 hrs      Monday        4 pm             not_done      not_done

The task id was again scanned at a fixed interval of 1 hr and added new features to data

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_8pm     status

aa      12/09/2019               20 hrs     tuesday       3pm              done         done 
    cc      17/10/2019               4 hrs      friday        10 pm            not_done     not_done
    dd      19/10/2019               6 hrs      friday        2 pm             done            done 
    ff      19/10/2019               9 hrs      Monday        4 pm             not_done        not_done

The final prediction of status == resolved / un_resolved in my understanding should be based on features including status_7pm and status_8pm.

How should the data structure for training such a classification model look like to generate a prediction at time 9 pm for sample task ff respectively

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_7pm status_8pm     status

ff      19/10/2019               9 hrs      Monday        4 pm            not_done           not_done      not_done

I assume the classification model should be trained on all status_1, status_2 ....status_8pm to classify  status. Or would the model be trained every time in memory once it gets a new column updated status every hour

lcrmorin · Answer

It seems the simplest way to go would be to build a line at each time step after 'time created'. With 'status_n-1', 'status_n'. That will allow to deal  with the notion of time in rows. You also might want to :

Deal with time relatively : instead of considering status at a given hour, you probably want to work with status since the creation of the task.
You will need to deal with task ponderation (longer tasks will get more rows) one way or another. You may add some ponderation in your model, based on 1/(expected length). 
Add some features : to me it is unclear what feature you use for prediction. As is, you seems to be trying to predict time of completion based on start time / expected length. You won't learn much except for which tasks takes more than expected / some daily effects. I think this can be achieved more efficiently / more clearly with simple statistics.

Classification with feature not available at time of model creation

One Answer

Add your own answers!

Ask a Question