How to Include Features that Apply to Specific Classes

Question

I'm predicting hours that will be worked for building tasks. Due to the overall low sample size, I've stacked multiple related tasks together into a single model. (There may be 100 total samples in a single model, each task having 10 to 20 samples individually) An example would be - how long will it take a worker to complete each task associated with installing 2 different sizes of pipe in a hospital.
There are many tasks associated with installing a pipe -

Cutting the pipe
Welding the pipe
Bending the pipe
Riveting the pipe

We know from experience that the more bends a pipe has - the more difficult it is weld. But the difficulty of cutting and riveting are completely unrelated to the number of bends. Additionally there are multiple sizes of pipe in a single model, and the above tasks are completely unrelated between different sizes of pipe. An example of the data is:

| Task       | Pipe Size | Amount | Ratio of Bends to Welds | Predicted Hours |
|------------|-----------|--------|-------------------------|-----------------|
| Cut Pipe   | 3 inches  | 5      | NULL                    | 2               |
| Weld Pipe  | 3 inches  | 10     | 2                       | 4               |
| Bend Pipe  | 3 inches  | 20     | NULL                    | 8               |
| Rivet Pipe | 3 inches  | 10     | NULL                    | 2               |
|            |           |        |                         |                 |
| Cut Pipe   | 10 inches | 1      | NULL                    | 1               |
| Weld Pipe  | 10 inches | 2      | 5                       | 2               |
| Bend Pipe  | 10 inches | 10     | NULL                    | 15              |
| Rivet Pipe | 10 inches | 1      | NULL                    | 0.5             |

There are many different types of these "ratio" features within a single model, my current plan is to include them and null out the feature in all other tasks where it isn't relevant. It's the first time I've stacked this many classes together in a single model, and also the first time I've encountered features which are only applicable to some rows and not others. I'm currently using a random forest model. Is there anything conceptually wrong with doing this?

Shahriyar Mammadli · Accepted Answer

If I understood correctly, you said that  Pipe Size does not have a correlation with the Predicted Hours. If you are sure that one or more variables do not possess the information about the target then drop them.
Other than that, if a feature (or some of your features) has a relation only with a specific task, then replace the null values with 0.
But in your case, I think you should also try the simpler algorithms like Linear Regression. But be careful about the sample size vs a number of features balance. Since you have few samples, use at most 1 variable for every 15 samples (In rare cases, 10 can be used too, but it is not recommended). As this source mentions:

In summarizing the findings by Schmidt, Green suggested that the minimum number of SPV ranges from 15 to 25.

Also, you need to evaluate your model's performance, but since you may have an unbalanced number of samples for each Task Type, be careful while splitting your data to train and test, make sure your test data has enough samples from each Task Type.
Some other important things:

Try to do bivariate (between the target variable and an independent variable) analysis before modeling, so you can eliminate unnecessary variables.
In the case of using Linear Regression, be careful about multicollinearity if you will use the coefficients to interpret your results. Because if your model has multicollinearity, then your coefficients will be incorrect although your predictions will note be spurious i.e. they will still be trustable.

How to Include Features that Apply to Specific Classes

One Answer

Add your own answers!

Ask a Question