TransWikia.com

Problem with a feature (normal distribution + peak around 0)

Data Science Asked by Mario Tormo on April 26, 2021

I have a feature that shows a characteristic of the instances. That characteristic can be present or not. If present it shows an almost normal distribution of values (actually a bit skewed to the right, but with a log transformation it becomes normalized). When the characteristic is not present in the instance, the value of the feature is just 0.

So at the end, I have a distribution with a lot of instances with value 0 and a bit far right from it the almost-normal distribution. I would like to split it in two different features: one that shows the absence/presence of the characteristic (easy), and a second that shows only a normal distribution without the annoying peak around zero.

2 Answers

Aren't you providing the answer? You can split the feature in two, namely, if feature_to_split is the feature you're talking about, you can create feature_to_split_ispresent which will take either 1 or 0 depending on the presence or absence of that specific characteristic, and feature_to_split_value which will take the actual value of that characteristic.

Answered by Francesco Alongi on April 26, 2021

I don't have a precise answer to that because it depends on what you want to do with that data. Assuming that your task is supervised learning since is the most popular, just extract that feature will be enough for a model to discriminate between different cases.

EDIT:

Models like linear regression or NN works better under normality regime; in this case I would try these options:

  1. Leave 0 because 0 * w = 0 so will be influent into the calculus but still remains the bias term
  2. Replace 0 with the mean of the non-zeros points so your distribution will be normal
  3. Scale non zeros point distribution to a N(0, 1) using standardization
  4. do 2) then 3)

Answered by Mikedev on April 26, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP