Data Science Asked on November 10, 2020
I am working on a model which will allow me to predict how long it will take for a "job" to be completed, based on historical data. Each job has a handful of categorical characteristics (all independant), and some historic data might look like:
JobID Manager City Design ClientType TaskDuration
a1 George Brisbane BigKahuna Personal 10
a2 George Brisbane SmallKahuna Business 15
a3 George Perth BigKahuna Investor 7
Thus far, my model has been relatively basic, following these basic steps:
Category Value Mean Count
Manager George 10.66 3
City Brisbane 12.5 2
City Perth 7 1
Design BigKahuna 8.5 2
Design SmallKahuna 15 1
ClientType Personal 10 1
ClientType Business 15 1
ClientType Investor 7 1
JobID Manager City Design ClientType
b5 George Brisbane SmallKahuna Investor
Category Value CalculatedMean CalculatedCount Factor (Mean * Count)
Manager George 10.66 3 31.98
City Brisbane 12.5 2 25
Design SmallKahuna 15 1 15
ClientType Investor 7 1 7
TaskDuration = SUM(Factor) / SUM(CalculatedCount)
= 78.98 / 7
= 11.283
~= 11 days
After testing my model on a few hundred finished jobs from the last four months, I calculated average discrepancies ranging from -15% to +25%.
I think the one of my issues is that I may be taking into account categories that actually have no effect on the build time, and are skewing my results. In reality, I’m taking 15 categories into account from ~400 completed jobs, and some of these categories might have results that only appear once or twice (for example, we might only have a single job in Perth).
How can I determine which categories are actually beneficial to the model, and which should be ignored?
You can try two things -
Try finding the correlation between the Categories and the Target.
Since, It's between Categorial features and a Continuous Feature, you should -
Get the r-square Or Adjusted R-square score of Regression, see which one is best and drop the lowest few and try.
Read more - Kaggle
Calculate Feature Importance using random Forest.
Read here - MachineLearningMastery
Answered by 10xAI on November 10, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP