Python: advice on which machine learning algorithm to use for a problem which involves lots of randomness

Question

I'm new to machine learning so I'll summarize my problem with two examples without getting technical (because I can't).
The dog vs. cat classification example is solvable, in the sense that a human can tell you whether it's a dog or a cat with certainty. Many machine learning algorithms are able to replicate the human performance and identify dogs or cats with near certainty.
For my problem, there is no certainty, only a slight-better-than-random prediction. I am trying to predict whether a person who was recently released from incarceration will commit a crime within the next year. Let's assume the actual chances of re-offending are about 50/50. If I could use machine learning to make a modestly better than a random prediction, that would be a huge win for me. More specifically, if 50/50 is a random guess, then if I could achieve a 55% to 60% success rate, that would be considered wildly successful.
I know this task is possible since I have used a dataset (with around 50 features and 100,000 observations) to make a "man-made" linear regression that achieves around 52% out of the sample.
I have tried SKLearn's logistic regression and XGBoost but their performance has been lower than my man-made attempt. I'm assuming that is because these algorithms aren't meant to deal with a prediction of an event that is mostly random.
Given that I am dealing with the prediction of an event that is mostly random and I am only looking to achieve slightly better than random predictions, is there a machine learning algorithm/strategy you could recommend to best tackle this problem?

bstrain · Answer

It sounds like you have good data - 50 columns and 100,00 rows!
I would do exploratory data analysis (EDA) and look for variables (columns) that are correlated with the response variable (re-offending) but NOT correlated with each other. If you can find a handful (~10) of these then you can build an excellent regression model.
Other techniques to try could include random forests and cluster analysis. Both can be done really quickly in Python so you can compare lots of different hyperparameter iterations.

Josh · Answer

Sorry that this isn't a concrete answer, but I can offer some advice.
It sounds like you have a problem of many weak relationships. In this case, I think xGBoost or RandomForest would yield better results than Logistic Regression.
Also remember that preprocessing your data and creating new features might help more than choosing a different algorithm. Consider different options for encoding your categorical variables into numbers. Look at python's category_encoders and try leave-one-out, response encoding, and others. Consider your imputation strategy for missing data - Using -99999 for missing might work well for xgBoost, but won't work so well with your regression. Consider using logloss as your optimization metric, or at least AUC. (Not just accuracy)
Above all else - see if you can find more data. For example -

Join other freely available data: i.e. if your data has zip code, can you join economic data by zip to add more features?
Leverage data you "ignored": i.e. do you have free form text data? Try parsing that into a sparse matrix using TF-IDF.

Lastly, are the xGBoost models performing badly overall, or are they overfitting and performing badly on your holdout? Look for techniques such as CrossFold Validation.

Python: advice on which machine learning algorithm to use for a problem which involves lots of randomness

2 Answers

Add your own answers!

Ask a Question