Data Science Asked on January 20, 2021
I have a dataset with 1000 rows and 4 columns with 3 outliers .I want to add another 7 outliers related to them for detection by clustering.
Example TO What I did
Col1 col2 Col3 col4
Out1 a1 b1 c1 d1
Out2 a2 b2 c2 d2
Out3 a3 b3 c3 d3
I get mean and std for 7 columns of normal data ten calculate
Out4 normal1+mean+stdcol1 norm1mean+stdcol2
Out5 normal2+mean+stdcol1 norm2mean+stdcol2
Out6 ...........
I don’t know if what i did is right or a good solution?
I don’t want outliers to be so easy for detection
Thanks
I'm assuming you want to create a point that, each column by itself appears normal, but when looking at all the columns appears as if it's an outlier (thus you'd need some sort of outlier detection). Thus the method of generating an outlier would require looking at all the dimensions in relation to each other. And since we didn't assume normality here, generating is not straightforward.
I would recommend first using some kind of outlier detection method from here on the original dataset, (Somethind like an Isolation Forest would work)
Then you can generate random numbers, (or use the numbers you generated) to test if they are outliers or not. This should be easy to do by hand since you only want 7 points and each point only has 4 dimensions. Also an additional tip would be to test the numbers using one of the methods that returns a score instead of a 0,1 prediction so that you can make sure it's not too obvious of an outlier (since you didn't want that).
Lastly, if you generated points, some sort of sanity check would be to append those points to the dataset, apply PCA to reduce the dimensions down to 2, plot the PCA result with a separate colour for the appended outlier points. And you can check by eye if the outliers are far apart but not too far apart from your dataset.
Hope this helps and gives you some ideas.
Answered by A Kareem on January 20, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP