RMSE is higher for bigger values of target variable - how to decrease

Question

I am solving a problem with machine learning and I have some data with two integer type independent variables and a continuous dependent variable. I am optimising to RMSE. I had fairly large RMSE value on my validation data. I learned that my model didn't do good on larger values of target; so, I have tried removing rows with larger values and that didn't help. So, now in the process of understanding the mistakes, I calculated RMSE for each ground truth value and it's prediction from validation set and plotted it to understand where big mistakes had happened. Apparently, my model still doesn't do good at larger values of target.
Here is the plot:

And here is the plot showing relationship between ground truth values and predictions:

As you can see, my model's predictions got worse as the values got large. How do I prevent this?
Some information about my data(only what I can reveal):

There is absolutely no linear relationship between independent variables and target. So, I have used a tree based model and random forest is giving me relatively good results.
I can even went to say that both of my independent variables can be called categorical variables with very high categories.
Also, there are a lot of values in independent variables, that occur only once.
all the variables are highly skewed to the right.(range of IV_1: 0 to 3,700; IV_2: 0 to 40; target variable: 0 to 39,000)

How do I decrease my RMSE or doing what will decrease it?

Carlos Mougan · Answer

If my understanding is right, you have a regression problem, with categorical features with high cardinality and "outliers" (or just big numbers).
How have you encoded categories? Target Encoding?
There is another option that is not encoding with the mean but with the median that on some cases can perform better.
On this notebook , you can see an implementation adn the results of this method.

10xAI · Answer

It looks like that your prediction is clamping at 750.
Be mindful of the fact that Tree can't predict a Regression value that is outside the range it has been trained on.
So, first of all, please assure that your data doesn't have a trend.

RMSE is higher for bigger values of target variable - how to decrease

2 Answers

Add your own answers!

Ask a Question