Model for predicting duration based on categorical data

Question

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like:
JobID   Manager     City        Design          ClientType      TaskDuration
a1      George      Brisbane    BigKahuna       Personal        10
a2      George      Brisbane    SmallKahuna     Business        15
a3      George      Perth       BigKahuna       Investor        7

Thus far, my model has been relatively basic, following these basic steps:

Aggregate the historical data based on each category, calculating the mean, and counting how many times it occurs. From the previous example, the result would be:

Category        Value           Mean    Count
Manager         George          10.66   3
City            Brisbane        12.5    2
City            Perth           7       1
Design          BigKahuna       8.5     2
Design          SmallKahuna     15      1
ClientType      Personal        10      1
ClientType      Business        15      1
ClientType      Investor        7       1

For each job in the system, calculate the job duration based on the above. For example:

JobID   Manager     City        Design          ClientType
b5      George      Brisbane    SmallKahuna     Investor

Category        Value           CalculatedMean      CalculatedCount     Factor (Mean * Count)
Manager         George          10.66               3                   31.98
City            Brisbane        12.5                2                   25
Design          SmallKahuna     15                  1                   15
ClientType      Investor        7                   1                   7

TaskDuration    = SUM(Factor) / SUM(CalculatedCount)
                = 78.98 / 7
                = 11.283
                ~= 11 days

After testing my model on a few hundred finished jobs from the last four months, I calculated average discrepancies ranging from -15% to +25%.
In my actual model I have 15 categories, and am drawing historical data from ~400 jobs.
I think the largest issue (amongst others) is the simplicity of my model. Are their better/well established methods for calculating a value based on categorical data? And if not, how can I improve my predictions?
Related question here.

shepan6 · Answer

So, from what I understand of the question, you are asking how to model the duration of a job, given the input (which includes City and ClientType).
In this case you can use something like a feedforward neural network to model this problem. You might find, that using this methods, the level of prediction error will be lower than that produced by your model, which could act as a baseline to see if these models do work better for your problem.
When representing each of the categorical variables, we use something called one-hot encoding. Then we concatenate these categorical variables into one n-dimensional input vector, which represents all the features for one example.

Model for predicting duration based on categorical data

One Answer

Add your own answers!

Ask a Question