How to determine a sufficient number of training examples for a linear regression classifier?

Question

How do we determine a sufficient number of training examples for a linear regression classifier?
What kind of behaviors might we expect if we use too few training examples?

Shahriyar Mammadli · Answer

For every 15-25 samples, you can have 1 additional feature. In rarest cases it might be low as 10 samples for each feature, however, it is not recommended unless you have good knowledge of your data and can safely be sure or test that your model is not showing spurious results. For more, please read the source.
Using a few samples or breaking above mentioned rule of using too many features for a few samples (e.g. 80 samples 15 features) may cause overfitting. In this case, it will not be able to generalize your data instead will learn your input data in detail. However, those details will probably be redundant information or randomness of your data so, your model will be useless in production. Another possibility is that it may give a rise to spurious results. In other words, since your samples are few, your model can somehow find a way to imitate your target variable with no logical relationship but by chance.
Those, if you have few samples try to preserve the above-mentioned rule. If you have many features do a bivariate analysis to find how good are these features to explain your target variable alone. One other way is to use algorithms that are more robust against the overfitting and can handle many features of a few samples. For example, SVM is accepted as better than Linear Regression in such cases.

How to determine a sufficient number of training examples for a linear regression classifier?

One Answer

Add your own answers!

Ask a Question