TransWikia.com

K-Prototype for anomaly detection

Data Science Asked by Sridhar Iyer on January 7, 2021

I have logs of the form (e.g. from a gym login).. the representational case is so:

UserName, Login time, timeSpend_on_weights, time_spent_on_elliptical

Ava, 5jan 12pm, 10 mins, 20 mins,
Bob, 5jan 2pm, 30 min, 20 mins,
Cecila, 6jan 10am, 40min, 0 mins
...

Now I’ve converted the above time column to HourOfDay and day of month to get:

UserName, DOM, HOD, #weights, #elliptical
Ava, 5, 12, 10, 20
Bob, 5, 14, 30, 20
Cecilia, 6, 10, 40, 0
..

I treat the first 3 columns as categorical data and the last two as numerical, and I run K-Prototypes with N=2 (anomalous or non-anomalous). The final predictions I get can be filtered on each user to find anomalies specific to the username. The anomalous cluster is the one with lesser elements.

However, for some of the users, the cluster partitions on the Login time (HOD/DOM).. E.g. everything before 12am is one cluster and everything after 12am is another one. That doesn’t convey any information.

What is the best way to handle these scenarios?

Is there a better way to do anomaly prediction on these kinds of dataset?

Update:
Type of anomalies I’m looking for:

  1. Ava did 20 mins in elliptical, that she never used before. This individually can be done simply by using some form of outlier analysis, or K-means (on dataset filtered by ‘Ava’)
  2. Ava did elliptical on Monday morning (Samething as above but filtered on Ava & time of the day).

Individually I can create models for each dataset with reasonable success, but how do I create one model that handles both of them.

If I use an actual clustering algorithm like DBSCAN/HDBSCAN, how do I not have it partition on the time? (or some other categorical variable)

One Answer

The problem is that you want to do supervised learning without having labels. You have some target in mind, but it might not be how the underlying data is split. Without any information on what is anomalous or not, your clustering algorithm won't try to work out that information. It will give you groups of users that have different behaviors (people that prefer weight vs. elliptical), but not necessarily the splitting that interest you.

First you need to think about and define what would be an anomaly for that problem : an unbalanced value ? an extreme value ? an impossible one ? An anomalous session vs. an anomalous user ? It's not clear from your question and generally speaking, it might help you work trough your problem, eventually by labelling your data.

To solve your ML problem, after working on your dataset (see below), you might consider different approaches :

  • If you want to continue on the unsupervised path : augment the N number, it might provide you more clusters of which some of them might correspond to what is anomalous to you. Then you might consider techniques like hdbscan that find outliers outside of the mains clusters.

  • Label the data then consider some fully supervised technique (a small random forest might be enough). You might need to define exactly what is anomalous to you (someone that come very often ? someone that badges multiple times thus giving 0 minutes sessions ? someone that stay the whole day ?).

  • Label the data for those you consider some sort of anomaly and use a semi supervised algorithm, like self organised map, that will help you retrieve those anomaly you missed in your labelling process.

Generally speaking it seems you also need to work on your dataset beforehand:

  • It seems you only have too few predictors. As is, it is unlikely your algorithm find meaningfull anomalies, that you couldn't find / work out without ML. There is two path : try do to some explanatory data analysis (frequency of training, distribution of time spent), that might help you understand what are your anomalies and design a non-ML solution. Or try to get more variables, that will help you detect anomalies and give them meaning.

  • You consider each entrance separately, based on the user name. First it may cause some problems if two user have the same identification. You might want to consider a unique identifier for user identification. Then, if you want to identify anomalies about user, not sessions, you might want to group entrance by user. That would require you to build other features, like frequency, average time spent... etc.

Answered by lcrmorin on January 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP