Data Science Asked by Atte Juvonen on February 17, 2021
I have a skewed distribution that looks like this:
How can I transform it to a Gaussian distribution? The values represent ranks, so modifying the values does not cause information loss as long as the order of values remains the same. I’m doing this to experiment if different distributions change the behavior of my ML models.
I’m working with Python/NumPy/Pandas/scikit-learn.
Edit: I should clarify that I have a lot of features and I’m looking to automatically transform all feature distributions. I was able to find a reasonable transformation for a single feature with a lot of experimentation, but it doesn’t generalize to other features:
normalize(np.log(0.30 + original))
.
** here would be image i.stack.imgur.com/uzorK.jpg
but I don’t have enough rep to post more than 2 images **
normalize(np.log(0.17 + another_feature_distribution))
.
In this image the purple bars represent the original distribution of another feature, green bars represent the transformed distribution. No matter how much I tweak the constant, I don’t get the high green bar on the left extreme to disappear. Also, I don’t have time to manually find a formula for each feature. Not sure if these are bell-shaped enough anyway?
You can do a log transformation on your data with the help of numpy log functionality as shown below :
log_data = np.log(data)
This will transform the data into a normal distribution. Moreover, you can also try Box-Cox transformation which calculates the best power transformation of the data that reduces skewness although a simpler approach which can work in most cases would be applying the natural logarithm. More details about Box-Cox transformation can be found here and here
Answered by enterML on February 17, 2021
For contemporary viewers, an update in scikit-learn now includes the PowerTransformation
in the API, providing a neat way of including these transforms in the workflow. See Preprocessing Transformers.
Answered by Rstall on February 17, 2021
If you fit a Johnson distribution to your data, the optimized a and b coefficients will transform the data to a normal distribution. See scipy.stats.johnsonsu or scipy.stats.johnsonsb
Answered by Josh on February 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP