TransWikia.com

Classification with feature not available at time of model creation

Data Science Asked by EricA on January 26, 2021

I have problem statement to predict the probability of solving a task depending on multiple features for e.g. when the task was created, the time needed to work on a task, etc Please find a dummy snippet attached

task_id  date_time_open    time_needed   day_created  time_created   status 

aa      12/09/2019             20 hrs     Tuesday        3 pm      done  
cc      17/10/2019             4 hrs      Friday        10 pm      not_done

I know I can run a classification model to identify the class. However, things complicate when I add a time dimension to it since the data set now gets an added feature which highly impacts the status

The task was scanned at suppose 7 pm and a new feature added for 7 pm

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_7pm     status            

    aa      12/09/2019               20 hrs     tuesday       3pm              done           done 
    cc      17/10/2019               4 hrs      friday        10 pm            done           not_done
    dd      19/10/2019               6 hrs      friday        2 pm             done          done 
    ff      19/10/2019               9 hrs      Monday        4 pm             not_done      not_done 

The task id was again scanned at a fixed interval of 1 hr and added new features to data

task_id  date_time_tsk_open    time_needed     day_created  time_created    status_8pm     status            

    aa      12/09/2019               20 hrs     tuesday       3pm              done         done 
    cc      17/10/2019               4 hrs      friday        10 pm            not_done     not_done
    dd      19/10/2019               6 hrs      friday        2 pm             done            done 
    ff      19/10/2019               9 hrs      Monday        4 pm             not_done        not_done 

The final prediction of status == resolved / un_resolved in my understanding should be based on features including status_7pm and status_8pm.

How should the data structure for training such a classification model look like to generate a prediction at time 9 pm for sample task ff respectively

  task_id  date_time_tsk_open    time_needed     day_created  time_created    status_7pm status_8pm     status            


    ff      19/10/2019               9 hrs      Monday        4 pm            not_done           not_done      not_done 

I assume the classification model should be trained on all status_1, status_2 ….status_8pm to classify status. Or would the model be trained every time in memory once it gets a new column updated status every hour

One Answer

It seems the simplest way to go would be to build a line at each time step after 'time created'. With 'status_n-1', 'status_n'. That will allow to deal with the notion of time in rows. You also might want to :

  • Deal with time relatively : instead of considering status at a given hour, you probably want to work with status since the creation of the task.

  • You will need to deal with task ponderation (longer tasks will get more rows) one way or another. You may add some ponderation in your model, based on 1/(expected length).

  • Add some features : to me it is unclear what feature you use for prediction. As is, you seems to be trying to predict time of completion based on start time / expected length. You won't learn much except for which tasks takes more than expected / some daily effects. I think this can be achieved more efficiently / more clearly with simple statistics.

Answered by lcrmorin on January 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP