Data Science Asked by EricA on January 26, 2021
I have problem statement to predict the probability of solving a task depending on multiple features for e.g. when the task was created, the time needed to work on a task, etc Please find a dummy snippet attached
task_id date_time_open time_needed day_created time_created status
aa 12/09/2019 20 hrs Tuesday 3 pm done
cc 17/10/2019 4 hrs Friday 10 pm not_done
I know I can run a classification model to identify the class. However, things complicate when I add a time dimension to it since the data set now gets an added feature which highly impacts the status
The task was scanned at suppose 7 pm and a new feature added for 7 pm
task_id date_time_tsk_open time_needed day_created time_created status_7pm status
aa 12/09/2019 20 hrs tuesday 3pm done done
cc 17/10/2019 4 hrs friday 10 pm done not_done
dd 19/10/2019 6 hrs friday 2 pm done done
ff 19/10/2019 9 hrs Monday 4 pm not_done not_done
The task id was again scanned at a fixed interval of 1 hr and added new features to data
task_id date_time_tsk_open time_needed day_created time_created status_8pm status
aa 12/09/2019 20 hrs tuesday 3pm done done
cc 17/10/2019 4 hrs friday 10 pm not_done not_done
dd 19/10/2019 6 hrs friday 2 pm done done
ff 19/10/2019 9 hrs Monday 4 pm not_done not_done
The final prediction of status == resolved / un_resolved in my understanding should be based on features including status_7pm and status_8pm.
How should the data structure for training such a classification model look like to generate a prediction at time 9 pm for sample task ff respectively
task_id date_time_tsk_open time_needed day_created time_created status_7pm status_8pm status
ff 19/10/2019 9 hrs Monday 4 pm not_done not_done not_done
I assume the classification model should be trained on all status_1, status_2 ….status_8pm to classify status. Or would the model be trained every time in memory once it gets a new column updated status every hour
It seems the simplest way to go would be to build a line at each time step after 'time created'. With 'status_n-1', 'status_n'. That will allow to deal with the notion of time in rows. You also might want to :
Deal with time relatively : instead of considering status at a given hour, you probably want to work with status since the creation of the task.
You will need to deal with task ponderation (longer tasks will get more rows) one way or another. You may add some ponderation in your model, based on 1/(expected length).
Add some features : to me it is unclear what feature you use for prediction. As is, you seems to be trying to predict time of completion based on start time / expected length. You won't learn much except for which tasks takes more than expected / some daily effects. I think this can be achieved more efficiently / more clearly with simple statistics.
Answered by lcrmorin on January 26, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP