Transform a skewed distribution into a Gaussian distribution

Question

I have a skewed distribution that looks like this:

How can I transform it to a Gaussian distribution? The values represent ranks, so modifying the values does not cause information loss as long as the order of values remains the same. I'm doing this to experiment if different distributions change the behavior of my ML models.
I'm working with Python/NumPy/Pandas/scikit-learn.
Edit: I should clarify that I have a lot of features and I'm looking to automatically transform all feature distributions. I was able to find a reasonable transformation for a single feature with a lot of experimentation, but it doesn't generalize to other features:
normalize(np.log(0.30 + original)).
** here would be image i.stack.imgur.com/uzorK.jpg but I don't have enough rep to post more than 2 images **
normalize(np.log(0.17 + another_feature_distribution)).

In this image the purple bars represent the original distribution of another feature, green bars represent the transformed distribution. No matter how much I tweak the constant, I don't get the high green bar on the left extreme to disappear. Also, I don't have time to manually find a formula for each feature. Not sure if these are bell-shaped enough anyway?

enterML · Answer

You can do a log transformation on your data with the help of numpy log functionality as shown below :

log_data = np.log(data)

This will transform the data into a normal distribution. Moreover, you can also try Box-Cox transformation which calculates the best power transformation of the data that reduces skewness although a simpler approach which can work in most cases would be applying the natural logarithm. More details about Box-Cox transformation can be found here and here

Rstall · Answer

For contemporary viewers, an update in scikit-learn now includes the PowerTransformation in the API, providing a neat way of including these transforms in the workflow. See Preprocessing Transformers.

Josh · Answer

If you fit a Johnson distribution to your data, the optimized a and b coefficients will transform the data to a normal distribution. See scipy.stats.johnsonsu or scipy.stats.johnsonsb

Transform a skewed distribution into a Gaussian distribution

3 Answers

Add your own answers!

Ask a Question