TransWikia.com

RMSE is higher for bigger values of target variable - how to decrease

Data Science Asked on May 10, 2021

I am solving a problem with machine learning and I have some data with two integer type independent variables and a continuous dependent variable. I am optimising to RMSE. I had fairly large RMSE value on my validation data. I learned that my model didn’t do good on larger values of target; so, I have tried removing rows with larger values and that didn’t help. So, now in the process of understanding the mistakes, I calculated RMSE for each ground truth value and it’s prediction from validation set and plotted it to understand where big mistakes had happened. Apparently, my model still doesn’t do good at larger values of target.

Here is the plot:
enter image description here

And here is the plot showing relationship between ground truth values and predictions:
enter image description here
As you can see, my model’s predictions got worse as the values got large. How do I prevent this?

Some information about my data(only what I can reveal):

  1. There is absolutely no linear relationship between independent variables and target. So, I have used a tree based model and random forest is giving me relatively good results.
  2. I can even went to say that both of my independent variables can be called categorical variables with very high categories.
  3. Also, there are a lot of values in independent variables, that occur only once.
  4. all the variables are highly skewed to the right.(range of IV_1: 0 to 3,700; IV_2: 0 to 40; target variable: 0 to 39,000)

How do I decrease my RMSE or doing what will decrease it?

2 Answers

If my understanding is right, you have a regression problem, with categorical features with high cardinality and "outliers" (or just big numbers).

How have you encoded categories? Target Encoding? There is another option that is not encoding with the mean but with the median that on some cases can perform better.

On this notebook , you can see an implementation adn the results of this method.

Answered by Carlos Mougan on May 10, 2021

It looks like that your prediction is clamping at 750.

Be mindful of the fact that Tree can't predict a Regression value that is outside the range it has been trained on.

So, first of all, please assure that your data doesn't have a trend.

Answered by 10xAI on May 10, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP