What is the best machine learning algorithm for large, noisy datasets with interaction between variables?

Question

My initial thought was a neural network but I don't see how a neural network can properly predict interaction between variables (ie. x1 * x2) since each node is just a sum of previous inputs?
Would a decision tree be better suited at capturing the interaction between variables?
My dataset is large, with 400 features and 5,000,000 instances. All data is in percentile and the label is also a percentile. The dataset is quite noisy as well, (customer data, predicting likelihood of becoming a return customer).

Shiv · Answer

Probabilistic Random Forest tends to work better then other algorithms on noisy datasets. But the data you are using also plays a major role on whether a algorithm will work or not.
Check this paper Probabilistic Random Forest for more details. Happy Learning

Chong Lip Phang · Answer

Ensemble methods, boosting or bagging, often give predictive accuracies superior to other methods. From my personal experience, I find GBM (ie. Gradient Boosting Regressor over Decision Trees) and LightGBM(faster) often give very accurate predictions.
Check out this diagram on choosing the right estimator.

jeffhale · Answer

I would make the following models:

a null baseline model
a linear regression model with the most highly correlated features
create polynomial features and do feature selection to just pick the top 10 or 20 features and try those with a linear regression model.
#3 but with ridge regression
a LightGBM model with the original features
If you think you can still squeeze out some performance and it's worth the time/effort tradeoff, move to neural nets. As long as you have a few layers and a decent number of nodes and a non-linear transformation (e.g. RELU) it should be able to pick up interactions.

If something looks promising, go that direction.

What is the best machine learning algorithm for large, noisy datasets with interaction between variables?

3 Answers

Add your own answers!

Ask a Question