Rule of thumb for good number of features when dealing with grouped data

Question

I have a classification problem on clinical data where I have multiple samples for each patient. So the samples related to the same patient are somehow dependent from each other.

I know that is not possible to know a priori the optimal number of features to use, but there are some rule of thumb that works in many cases.

My question is: are those rules valid also in my case? In particular, I should relate the number of features to the number of instances or to the number of groups?

Thanks

TitoOrt · Answer

This a really hard question to answer. I recommend you do some reading to get a feeling on what can be done and particularly, what can be done for your particular task.

This paper is a must. But if you prefer a more practical approach have a look at these two interesting sources:

a) ML Mastery which also provides additional further readings

b) Kaggle

Good luck!

missrg · Answer

I am sorry to say that I am not aware of a simple "rule of thump", as this varies a lot according to the nature of the problem. But below you can find some guidelines you can use to determine the "optimal" number of features for your problem.

First of all, you should use some dimensionality reduction in order to reduce the number of columns that you are going to use as input. Dimensionality reduction techniques are separated in 2 categories: Feature transformation and feature selection.

Feature transformation techniques restructure the feature-space and produce a new set of features based on the old ones. A very popularly used technique for dimensionality reduction is Principal Component Analysis (pca) that uses some orthogonal transformation in order to produce a set of linearly non-correlated variables based on the initial set of variables.
Feature selection techniques actually select the features with the highest "importance"/influence on the output variable, from the set of existing features. Some popular techniques are Fisher score (actually assigns weights to the features based on some "importance" criteria), Recursive Feature Elimination (usually provides quite good results when combined with SVM classifier) etc.

The following material might help you select dimensionality reduction/feature selection approach.

A review article for feature selection for classification
A quite good summary of dimensionality reduction techniques

Now, the next step after selecting the right method and the right classification algorithm is to find out which is the optimal number of features for your problem. A good idea would be to redo the classification recursively every time adding one extra feature and observe the Classification Error. Given that the feature selection technique will work well, you are expected to observe something like this:

The blue dotted line shows the point where the Classification Error of the validation set gets its minimum value. This point indicates the optimal number of features for your problem. After this, the error of the validation set starts increasing while the training set error keeps decreasing - which is an indication of overfitting.
(most probably the curves that you will get from your real data will not be that smooth, there might be some fluctuations and the pattern will be less clear - but more or less this will be the general pattern)

Keep in mind that after the optimal number of features is determined, a separate test set should be used to evaluate the final model (since you used the validation set for calculating one of the model's parameters you cannot also use it for the evaluation).

fractalnature · Answer

You may want to define the problem a bit more. I think the most vital piece of information that would help answer this question is whether you are trying to classify patients or condition within patients (ie: "Does the patient have disease X?" vs "Is the patient in X state"?)

If you are building a model to determine whether or not a patient is in X state, then I think feature selection is not really what you should be thinking about. I would probably consider this as a batch effect problem. This makes sense in the case that you want to use as many samples as you can and therefore have multiple samples from each patient, but each patient might have different baselines or differing variation within their measurements. Therefore determining changes in the patient will be obscured unless the features are normalized within each batch.

Normally batch effects refer to difference in batches produced by different lab equipment. However, in this case, I think you could think of the patients as batches. therefore, you can check if there are batch effects by doing PCA and looking at a plot of P1 vs P2 with the samples colored by patient.If the samples are clustering together by color, then you should try correcting for batch effects by standardizing the features for each patient separately. Then redo the PCA and see if batch effects are removed.

At that point, you can just build your classification model and use feature selection or regularization as you normally would.

In the case that you are classifying the patients (ie patient has disease X or not), its clear that the difference between patients is actually what you need to build this model. I doubt that there is some rule of thumb about how many features you should use depending on the number of groups or samples within the group. You could try doing cross validation with random sampling per patient.

Rule of thumb for good number of features when dealing with grouped data

3 Answers

Add your own answers!

Ask a Question