Data Science Asked by cips on June 11, 2021
I am using sklearn and python, to build a "malicious" login identifier. Reading some documents and examples, I chose the RandomForest classifier, then I decided to use the following features:
Timestamp (hour)
IP (this is expanded in feature creation adding geolocation info like Coordinates)
Username (hashed)
In order to train the model, I got some data from a real log. I’m assuming here that these are GOOD data. Then I need to train the BAD (malicious) login.
For this I got some IPs from some public blacklist and then I create a BAD training file with the same format.
But since I don’t know the username for BAD data (since these are not real login), I decided to set it to 0 and than train the model.
I think that this is messing everything up. The model does not work very well and it is not able to detect BAD login, unless I try to query it with a ‘0’ login name. In other words, if I query the model with a BAD IP and a valid username, I always get a "GOOD" result.
May be this is completely expected in this case, but I’d like to understand some things:
If an information is never available for one of the classes, it's not a usable indication. So it seems to me that the login name is simply irrelevant for the task, so it shouldn't be included as a feature.
Mixing different sources of data can be ok in some cases, but only if the different datasets provide consistent features and are generally representative of the distribution you would expect (another important problem). It looks like you have no username nor timestamp in the "bad" one, do you? if so it's impossible to use these features.
No you cannot give priority to any feature. This wouldn't make sense, since the point of ML is to let the algorithm find patterns in the data. If you already know how the decision should be made then you should simply write a regular deterministic program (this is called a heuristic).
Don't forget that you can decide to add any feature that you think could be relevant:
Correct answer by Erwan on June 11, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP