TransWikia.com

Incorrect multi-variate anomaly detection - Isolation Forest Python

Data Science Asked by The AG on September 3, 2020

My data looks like below. it has 333 rows and 2 columns. Clearly the first row is anomaly.

ndf:

   +----+---------+-------------+
|    | ROW_CNT |    TOT_SALE |
+----+---------+-------------+
|  0 |      45 |     1411.27 |
+----+---------+-------------+
|  1 |   47754 |  1596200.68 |
+----+---------+-------------+
|  2 |  105894 |  3750304.55 |
+----+---------+-------------+
|  3 |  372953 | 14368324.86 |
+----+---------+-------------+
|  4 |  389915 | 14899302.85 |
+----+---------+-------------+
|  5 |  379473 | 14696309.67 |
+----+---------+-------------+
|  6 |  388571 | 14679457.93 |
+----+---------+-------------+
|  7 |  234409 |  8226472.95 |
+----+---------+-------------+
|  8 |   50587 |  1673114.75 |
+----+---------+-------------+
|  9 |  383779 | 14614106.80 |
+----+---------+-------------+
| 10 |  391525 | 14907049.92 |
+----+---------+-------------+
| 11 |  392012 | 13482471.85 |
+----+---------+-------------+
| 12 |  379081 | 14324222.03 |
+----+---------+-------------+
| 13 |  383681 | 14478162.98 |
+----+---------+-------------+
| 14 |  228857 |  7994892.44 |
+----+---------+-------------+

I am using below function to detect anomaly on 2 columns in the dataset:

def outlier_func(df):
    model = IsolationForest(behaviour='new',n_estimators=1000,  max_samples='auto', 
    contamination='auto', max_features=1.0)  
    model.fit(df[['ROW_CNT', 'TOT_SALE']])
    df['scores'] = model.decision_function(df[['ROW_CNT', 'TOT_SALE']])
    df['anomaly'] = model.predict(df[['ROW_CNT', 'TOT_SALE']])
    anomaly = df.loc[df['anomaly'] == -1]
    anomaly_index = list(anomaly.index)
    return anomaly          

outlier_func(ndf)

What am i missing that it is incorrectly detecting the anomaly. Any help would be appreciated.

One Answer

One way to improve the efficiency of the prediction is to convert the dataframe type to float32. I got better result while converting the data type to float32.

df = pd.DataFrame(np.array([[45,1411.27],[47754,1596200.68],[105894,3750304.55],[372953,14368324.86],[389915,14899302.85]]),columns=['ROW_CNT','TOT_SALE'],dtype=np.float32)
def outlier_func(df):
model = IsolationForest(behaviour='new',n_estimators=1000,  max_samples='auto', 
contamination='auto', max_features=1.0)  
model.fit(df[['ROW_CNT', 'TOT_SALE']])
df['scores'] = model.decision_function(df[['ROW_CNT', 'TOT_SALE']])
df['anomaly'] = model.predict(df[['ROW_CNT', 'TOT_SALE']])
anomaly = df.loc[df['anomaly'] == -1]
anomaly_index = list(anomaly.index)
return anomaly          

outlier_func(df)

enter image description here

Answered by Sebin Sunny on September 3, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP