Data Science Asked by Elliot on August 19, 2020
I have a dataset where the target label is positively skewed and produces a long tail, and currently I have a high residual on these values when experimenting with some linear, tree-based and neural-network regression models.
I see the same problem with the Boston Housing prediction dataset, and recommendations to apply a log transformation to the target label. This has given some small improvement but not enough. Additionally I’ve tried to randomly duplicate values within the tail to shift the mean, although I’m not overly comfortable with method.
Are there any alternative transformations to apply, or any models that can put a higher cost weighting on labels with high residuals?
Something that might work is normalizing/standardizing the output, e.g. with target over [0,1] (see min-max scaling). Changing the distribution isn't ideal, and I can't see how changing the mean would lead to increased performance. If it makes sense, you can also potentially experiment with changing from a strict continuous linear regression to a categorical interpretation (see ordinal regression). Classification is generally easier than regression, so if you're able to frame it in that way sometimes it helps performance, especially if the end goal in mind is, for example, only to determine a binary decision (buy/not buy).
For example, for income, regressing over strict amount will lead to imbalance, and the difference between 100 billion net worth and 500 billion net worth is negligible to the mean. If you can accept classifying might be useful, you can use logistic regression, SVM etc and experiment with generating a classification instead of a direct prediction.
Answered by Benjamin Ricard on August 19, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP