Data Science Asked by phatgamer69 on March 3, 2021
I’m predicting hours that will be worked for building tasks. Due to the overall low sample size, I’ve stacked multiple related tasks together into a single model. (There may be 100 total samples in a single model, each task having 10 to 20 samples individually) An example would be – how long will it take a worker to complete each task associated with installing 2 different sizes of pipe in a hospital.
There are many tasks associated with installing a pipe –
We know from experience that the more bends a pipe has – the more difficult it is weld. But the difficulty of cutting and riveting are completely unrelated to the number of bends. Additionally there are multiple sizes of pipe in a single model, and the above tasks are completely unrelated between different sizes of pipe. An example of the data is:
| Task | Pipe Size | Amount | Ratio of Bends to Welds | Predicted Hours | |------------|-----------|--------|-------------------------|-----------------| | Cut Pipe | 3 inches | 5 | NULL | 2 | | Weld Pipe | 3 inches | 10 | 2 | 4 | | Bend Pipe | 3 inches | 20 | NULL | 8 | | Rivet Pipe | 3 inches | 10 | NULL | 2 | | | | | | | | Cut Pipe | 10 inches | 1 | NULL | 1 | | Weld Pipe | 10 inches | 2 | 5 | 2 | | Bend Pipe | 10 inches | 10 | NULL | 15 | | Rivet Pipe | 10 inches | 1 | NULL | 0.5 |
There are many different types of these "ratio" features within a single model, my current plan is to include them and null out the feature in all other tasks where it isn’t relevant. It’s the first time I’ve stacked this many classes together in a single model, and also the first time I’ve encountered features which are only applicable to some rows and not others. I’m currently using a random forest model. Is there anything conceptually wrong with doing this?
If I understood correctly, you said that Pipe Size does not have a correlation with the Predicted Hours. If you are sure that one or more variables do not possess the information about the target then drop them.
Other than that, if a feature (or some of your features) has a relation only with a specific task, then replace the null values with 0.
But in your case, I think you should also try the simpler algorithms like Linear Regression. But be careful about the sample size vs a number of features balance. Since you have few samples, use at most 1 variable for every 15 samples (In rare cases, 10 can be used too, but it is not recommended). As this source mentions:
In summarizing the findings by Schmidt, Green suggested that the minimum number of SPV ranges from 15 to 25.
Also, you need to evaluate your model's performance, but since you may have an unbalanced number of samples for each Task Type, be careful while splitting your data to train and test, make sure your test data has enough samples from each Task Type.
Some other important things:
Correct answer by Shahriyar Mammadli on March 3, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP