TransWikia.com

Transform a skewed distribution into a Gaussian distribution

Data Science Asked by Atte Juvonen on February 17, 2021

I have a skewed distribution that looks like this:

picture of a skewed distribution

How can I transform it to a Gaussian distribution? The values represent ranks, so modifying the values does not cause information loss as long as the order of values remains the same. I’m doing this to experiment if different distributions change the behavior of my ML models.

I’m working with Python/NumPy/Pandas/scikit-learn.

Edit: I should clarify that I have a lot of features and I’m looking to automatically transform all feature distributions. I was able to find a reasonable transformation for a single feature with a lot of experimentation, but it doesn’t generalize to other features:

normalize(np.log(0.30 + original)).

** here would be image i.stack.imgur.com/uzorK.jpg but I don’t have enough rep to post more than 2 images **

normalize(np.log(0.17 + another_feature_distribution)).

enter image description here

In this image the purple bars represent the original distribution of another feature, green bars represent the transformed distribution. No matter how much I tweak the constant, I don’t get the high green bar on the left extreme to disappear. Also, I don’t have time to manually find a formula for each feature. Not sure if these are bell-shaped enough anyway?

3 Answers

You can do a log transformation on your data with the help of numpy log functionality as shown below :

log_data = np.log(data)

This will transform the data into a normal distribution. Moreover, you can also try Box-Cox transformation which calculates the best power transformation of the data that reduces skewness although a simpler approach which can work in most cases would be applying the natural logarithm. More details about Box-Cox transformation can be found here and here

Answered by enterML on February 17, 2021

For contemporary viewers, an update in scikit-learn now includes the PowerTransformation in the API, providing a neat way of including these transforms in the workflow. See Preprocessing Transformers.

Answered by Rstall on February 17, 2021

If you fit a Johnson distribution to your data, the optimized a and b coefficients will transform the data to a normal distribution. See scipy.stats.johnsonsu or scipy.stats.johnsonsb

Answered by Josh on February 17, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP