Good approach to increase accuracy for a continuous value that is highly variable/sensitive to the inputs?

Question

I am trying to predict a continuous 'Y' variable using a variety of algorithms and feature engineering techniques. My issue is that Y is extremely variable and I reached a asymptote in accuracy.
This is the structure of my feature variables (with swapped variable names)

Screw Width (mm)
Screw Height (mm)
Screw Angle (degrees)
Screw Type
Screw Material
Car Model Id
Car Age (Years)

0.53
0.24
43
Eye Bolt
Carbon
1
3

My target variables go as followed:

Speed without Screw
Speed with Screw

24
29

Which I merge into a single variable Y:

Speed Delta

5

The range of the Delta can go from -2000. to 165,000. Please note I swapped the variable names again so I'm not in reality predicting speed.
Currently my R2 score is 0.9 and the mean prediction error rate is 25%. I want to get it down to 15%
What I have tried so far:

Constraining the Input and Sensitivity Analysis on the variables for current Regressor Choice

For some time I tried:

Splitting the data (e.g. run on Screw Width < 0.5 mm)
Build a model on the split data
Record the new R2 score/Mean Error

To my surprise it wasn't one variable or range of that variable that was causing huge variance as repeating this experiment gave me similar results. For example filtering heavily on the variables and repeating steps 1-3 gave me results like as followed:

Using different regressors to predict a single continuous variable

I tried using a simple Linear Regressor, MLP Regressor, Gradient Boosting and Random Forest. I haven't played too much with the parameters of each regressor and am not sure if that is my downfall. Currently a RandomForestRegressor is giving me best results but each regressor converges around a similar range.

Simple Feature Engineering:

I tried OHE for non numerical variables, Log Transforming the Data. The difference was noticeable but still small.

Running more data

I have close to 250,000 simulations (or rows) and reached a bit of an asymptote at 25% accuracy after 150,000 simulations with the strategies/algorithms I deployed.

Summary: I need advise on how to tackle this problem. Do I just keep on randomly trying methods/manipulating the data until something sticks. Or is there a better path moving forward to get a better accuracy?
What have I not tried?

Nikos M. · Answer

Taking the difference (ie speed1-speed2) as the target variable effectively dismisses any low-frequency variablitiy and targets only high-frequency variability, even noise.
One approach would be to bin the (highly-variable) target variable into fixed range bins and take the mid point (or any other fixed point) of each bin as the new target (stabilised) variable. This will decrease variability (and noise) a bit, or even a lot. Then any regression algorithm can be used on the new stabilised target variable with most probably better results (depending of course on the number of bins, more bins more variability, less bins less variability but there can be some balance point).

Brian Spiering · Answer

One option would be to transform the y / target variable to be distributed more like a Gaussian, the most common transformations are log and quantile transformation. Gaussian transformation often increases the model fit statistics.

Answered by Brian Spiering on April 30, 2021

Good approach to increase accuracy for a continuous value that is highly variable/sensitive to the inputs?

2 Answers

Add your own answers!

Ask a Question