Data Science Asked by Kadin on September 16, 2020
I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like:
JobID Manager City Design ClientType TaskDuration
a1 George Brisbane BigKahuna Personal 10
a2 George Brisbane SmallKahuna Business 15
a3 George Perth BigKahuna Investor 7
Thus far, my model has been relatively basic, following these basic steps:
Category Value Mean Count
Manager George 10.66 3
City Brisbane 12.5 2
City Perth 7 1
Design BigKahuna 8.5 2
Design SmallKahuna 15 1
ClientType Personal 10 1
ClientType Business 15 1
ClientType Investor 7 1
JobID Manager City Design ClientType
b5 George Brisbane SmallKahuna Investor
Category Value CalculatedMean CalculatedCount Factor (Mean * Count)
Manager George 10.66 3 31.98
City Brisbane 12.5 2 25
Design SmallKahuna 15 1 15
ClientType Investor 7 1 7
TaskDuration = SUM(Factor) / SUM(CalculatedCount)
= 78.98 / 7
= 11.283
~= 11 days
After testing my model on a few hundred finished jobs from the last four months, I calculated average discrepancies ranging from -15% to +25%.
In my actual model I have 15 categories, and am drawing historical data from ~400 jobs.
I think the largest issue (amongst others) is the simplicity of my model. Are their better/well established methods for calculating a value based on categorical data? And if not, how can I improve my predictions?
So, from what I understand of the question, you are asking how to model the duration of a job, given the input (which includes City
and ClientType
).
In this case you can use something like a feedforward neural network to model this problem. You might find, that using this methods, the level of prediction error will be lower than that produced by your model, which could act as a baseline to see if these models do work better for your problem.
When representing each of the categorical variables, we use something called one-hot encoding. Then we concatenate these categorical variables into one n-dimensional input vector, which represents all the features for one example.
Answered by shepan6 on September 16, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP