Determining which categorical data is beneficial in predictive modelling

Question

I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like:
JobID   Manager     City        Design          ClientType      TaskDuration
a1      George      Brisbane    BigKahuna       Personal        10
a2      George      Brisbane    SmallKahuna     Business        15
a3      George      Perth       BigKahuna       Investor        7

Thus far, my model has been relatively basic, following these basic steps:

Aggregate the historical data based on each category, calculating the mean, and counting how many times it occurs. From the previous example, the result would be:

Category        Value           Mean    Count
Manager         George          10.66   3
City            Brisbane        12.5    2
City            Perth           7       1
Design          BigKahuna       8.5     2
Design          SmallKahuna     15      1
ClientType      Personal        10      1
ClientType      Business        15      1
ClientType      Investor        7       1

For each job in the system, calculate the job duration based on the above. For example:

JobID   Manager     City        Design          ClientType
b5      George      Brisbane    SmallKahuna     Investor

Category        Value           CalculatedMean      CalculatedCount     Factor (Mean * Count)
Manager         George          10.66               3                   31.98
City            Brisbane        12.5                2                   25
Design          SmallKahuna     15                  1                   15
ClientType      Investor        7                   1                   7

TaskDuration    = SUM(Factor) / SUM(CalculatedCount)
                = 78.98 / 7
                = 11.283
                ~= 11 days

After testing my model on a few hundred finished jobs from the last four months, I calculated average discrepancies ranging from -15% to +25%.
I think the one of my issues is that I may be taking into account categories that actually have no effect on the build time, and are skewing my results. In reality, I'm taking 15 categories into account from ~400 completed jobs, and some of these categories might have results that only appear once or twice (for example, we might only have a single job in Perth).
How can I determine which categories are actually beneficial to the model, and which should be ignored?
Related question here.

10xAI · Answer

You can try two things -

Try finding the correlation between the Categories and the Target.
Since, It's between Categorial features and a Continuous Feature, you should -
Get the r-square Or Adjusted R-square score of Regression, see which one is best and drop the lowest few and try.
Read more - Kaggle

Calculate Feature Importance using random Forest.
 Read here - MachineLearningMastery

Determining which categorical data is beneficial in predictive modelling

One Answer

Add your own answers!

Ask a Question