adding a feature as "generic"

Question

I am using sklearn and python, to build a "malicious" login identifier. Reading some documents and examples, I chose the RandomForest classifier, then I decided to use the following features:
Timestamp (hour)
IP (this is expanded in feature creation adding geolocation info like Coordinates)
Username (hashed)

In order to train the model, I got some data from a real log. I'm assuming here that these are GOOD data. Then I need to train the BAD (malicious) login.
For this I got some IPs from some public blacklist and then I create a BAD training file with the same format.
But since I don't know the username for BAD data (since these are not real login), I decided to set it to 0 and than train the model.
I think that this is messing everything up. The model does not work very well and it is not able to detect BAD login, unless I try to query it with a '0' login name. In other words, if I query the model with a  BAD IP and a valid username, I always get a "GOOD" result.
May be this is completely expected in this case, but I'd like to understand some things:

is this way of creating the BAD list actually wrong?
do I have a way to consider a feature like a "wildcard" without replicating the line in the training for any good username?
is it possible to set a "priority" in features evaluation, so that some are more important than other? But this should be done by the algorithm itself....or not?

Erwan · Accepted Answer

If an information is never available for one of the classes, it's not a usable indication. So it seems to me that the login name is simply irrelevant for the task, so it shouldn't be included as a feature.
Mixing different sources of data can be ok in some cases, but only if the different datasets provide consistent features and are generally representative of the distribution you would expect (another important problem). It looks like you have no username nor timestamp in the "bad" one, do you? if so it's impossible to use these features.
No you cannot give priority to any feature. This wouldn't make sense, since the point of ML is to let the algorithm find patterns in the data. If you already know how the decision should be made then you should simply write a regular deterministic program (this is called a heuristic).
Don't forget that you can decide to add any feature that you think could be relevant:

you could have a boolean feature indicating whether a login name is known or unknown, instead of setting the value to 0. But be careful that by definition in your external dataset they are all unknown whereas they are all known in your "good" dataset, that's an clear construction bias.
if you have a timestamp you could think of many other features, for instance number of repeated attempts in the past N seconds, regular time patterns for login, etc.

adding a feature as "generic"

One Answer

Add your own answers!

Ask a Question