Dissecting performance issues with Random Forest

Question

My task is to identify potential situation for trading and determine whether a candidate is going to succeed or not. I have a system in place to identify candidates, but there is a high rate of false positives.

To try to reduce it, I have been training a Random Forest in the hopes of pre-emptively eliminating high risk candidates. The problem is that I am using two separate sources of data: one which is a 10-year historical record, and the other which is obtained in real-time and stored in a separate database. If I split the historical data into training and testing (let's say 0.7 - 0.3 split), the results are improbably perfect: no mistakes in the classification for categories with 100's of observations. However, when I try to apply the trained model to the observations from the real-time data, the labelling accuracy is about 50%. This has presisted, despite attempts to tweak hyperparameters and introduce new features.

How could I investigate if there are substantial differences between the data from the two sources, and how could I try to go about ameliorating this situation?

A Co · Answer

There are several things you can do, even though giving a detailed descripton of your datasets would help a lot:

The first thing is that you should is to thoroughly compare the two datasets:

Are the same features present in both datasets?
Is the data collected in the exact same way? As an example, if age is a feature make sure that it is not recorded differently. Let's say one dataset could use the categories (<25, 25-50, 50+) and the other the exact age.
How do they differ in size?
Plot the distribution of each feature in both dataset: do they differ significantly, and why? You could test the hypothesis that a feature is drawn from the same distribution in both datasets.
Look for correlated features: correlated with each other and with the target distribution

The amelioration of your algorithm will entirely depend on the answer to these questions. If you observe significant differences between the two datasets then your historical data will not be a good training set for this task.

Independently from that, your 100% accuracy is very suspicious and your algorithm is probably overfitting.

Dissecting performance issues with Random Forest

One Answer

Add your own answers!

Ask a Question