SVM using scikit learn runs endlessly and never completes execution

Question

I am trying to run SVR using scikit-learn (python) on a training dataset that has 595605 rows and 5 columns (features) while the test dataset has 397070 rows. The data has been pre-processed and regularized.
I am able to successfully run the test examples, but on executing using my dataset and letting it run for over an hour, I could still not see any output or termination of the program. I tried executing using a different IDE and even from the terminal, but that does not seem to be the issue.
I also tried changing the 'C' parameter value from 1 to 1e3.
I am facing similar issues with all SVM implementations using scikit.
Am I not waiting long enough for it to complete?
How much time should this execution take?
From my experience, it should not require more than a few minutes.
Here is my system configuration:
Ubuntu 14.04, 8GB RAM, lots of free memory, 4th gen i7 processor

Jaidev Deshpande · Answer

This makes sense. IIUC, the speed of execution of support vector operations is bound by number of samples, not dimensionality. In other words, it is capped by CPU time and not RAM. I'm not sure exactly how much time this should take, but I'm running some benchmarks to find out.

Answered by Jaidev Deshpande on February 21, 2021

Jessica Collins · Answer

Kernelized SVMs require the computation of a distance function between each point in the dataset, which is the dominating cost of $mathcal{O}(n_text{features} times n_text{observations}^2)$. The storage of the distances is a burden on memory, so they're recomputed on the fly. Thankfully, only the points nearest the decision boundary are needed most of the time. Frequently computed distances are stored in a cache. If the cache is getting thrashed then the running time blows up to $mathcal{O}(n_text{features} times n_text{observations}^3)$.

You can increase this cache by invoking SVR as

model = SVR(cache_size=7000)

In general, this is not going to work. But all is not lost. You can subsample the data and use the rest as a validation set, or you can pick a different model. Above the 200,000 observation range, it's wise to choose linear learners.

Kernel SVM can be approximated, by approximating the kernel matrix and feeding it to a linear SVM. This allows you to trade off between accuracy and performance in linear time.

A popular means of achieving this is to use 100 or so cluster centers found by kmeans/kmeans++ as the basis of your kernel function. The new derived features are then fed into a linear model. This works very well in practice. Tools like sophia-ml and vowpal wabbit are how Google, Yahoo and Microsoft do this. Input/output becomes the dominating cost for simple linear learners.

In the abundance of data, nonparametric models perform roughly the same for most problems. The exceptions being structured inputs, like text, images, time series, audio.

Further reading

How to implement this.
How to train an ngram neural network with dropout that scales linearly
Kernel Approximations
A formal paper on using kmeans to approximate kernel machines

Leela Prabhu · Answer

With such a huge dataset I think you'd be better off using a neural network, deep learning, random forest (they are surprisingly good), etc.

As mentioned in earlier replies, the time taken is proportional to the third power of the number of training samples. Even the prediction time is polynomial in terms of number of test vectors.

If you really must use SVM then I'd recommend using GPU speed up or reducing the training dataset size. Try with a sample (10,000 rows maybe) of the data first to see whether it's not an issue with the data format or distribution.

As mentioned in other replies, linear kernels are faster.

Diego · Answer

Leave it to run overnight or better for 24 hours. 
What is your CPU utilization? If none of the cores is running at 100% then you have a problem. Probably with memory. Have you checked whether your dataset fits into 8GB at all?
Have you tried the SGDClassifier? It is one of the fastest there. Worth giving it a try first hoping it completes in an hour or so.

Ricardo Cruz · Answer

SVM solves an optimization problem of quadratic order.

I do not have anything to add that has not been said here. I just want to post a link the sklearn page about SVC which clarifies what is going on:

The implementation is based on libsvm. The fit time complexity is more
  than quadratic with the number of samples which makes it hard to scale
  to dataset with more than a couple of 10000 samples.

If you do not want to use kernels, and a linear SVM suffices, there is LinearSVR which is much faster because it uses an optimization approach ala linear regressions. You'll have to normalize your data though, in case you're not doing so already, because it applies regularization to the intercept coefficient, which is not probably what you want. It means if your data average is far from zero, it will not be able to solve it satisfactorily.

What you can also use is stochastic gradient descent to solve the optimization problem. Sklearn features SGDRegressor. You have to use loss='epsilon_insensitive' to have similar results to linear SVM. See the documentation. I would only use gradient descent as a last resort though because it implies much tweaking of the hyperparameters in order to avoid getting stuck in local minima. Use LinearSVR if you can.

Rishabh Gupta · Answer

You need to scale your data. Scaling will normalize your data points to -1 to 1 range, which will help in faster convergence.

Try using following code:

# X is your numpy data array.

from sklearn import preprocessing

X = preprocessing.scale(X)

Shelby Matlock · Answer

Did you include scaling in your pre-processing step? I had this issue when running my SVM. My dataset is ~780,000 samples (row) with 20 features (col). My training set is ~235k samples. It turns out that I just forgot to scale my data! If this is the case, try adding this bit to your code:
scale data to [-1,1] ; increase SVM speed:
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train = scaling.transform(X_train)
X_test = scaling.transform(X_test)

Dutse I · Answer

I recently encountered similar problem because forgot to scale features in my dataset which was earlier used to train ensemble model kind. Failure to scale the data may be the likely culprit as pointed by Shelby Matlock. You may try different scalers available in  sklearn, such as RobustScaler:
from sklearn.preprocessing import RobustScaler
 scaler = RobustScaler()
 X = scaler.fit_transfrom(X)

X is now transformed/scaled and ready to be fed to your desired model.

Sujay_K · Answer

Try normalising the data to [-1,1]. I faced a similar problem and upon normalisation everything worked fine. You can normalise data easily using:
from sklearn import preprocessing
X_train = preprocessing.scale(X_train)
X_test = preprocessing.scale(X_test)

Martin Thoma · Answer

I just had a similar issue with a dataset which contains only 115 elements and only one single feature (international airline data). The solution was to scale the data. What I missed in answers so far was the usage of a Pipeline:

from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

model = Pipeline([('scaler', StandardScaler()),
                  ('svr', SVR(kernel='linear'))])

You can train model like a usual classification / regression model and evaluate it the same way. Nothing changes, only the definition of the model.

Habib Karbasian · Answer

I have encountered this issue and cache_size as others are suggesting does not help at all. You can see this post and this one as the main contributor suggested that you should change the code manually.

As you know, SVC and SVR are optimization problems and they stop when the error margin is so little where the further optimization is futile. So there is another parameter in these, max_iter, where you can set how many iterations it should do.

I have used sklearn in python and e1071 in R and R is much faster getting to the result without setting the max_iter and sklearn takes 2-4 times longer. The only way that I could bring down the computation time for python was using max_iter. It is relative to the complexity of your model, number of features, kernels and hyperparameters, but for small dataset I used for around 4000 datapoint and max_iter was 10000 the results were not different at all and it was acceptable.

DEEPAK MEWADA · Answer

I also faced a similar problem with SVM training taking infinite time.
Now, the problem is resolved by preprocessing the data.
Please add the following lines in your code before training:
from sklearn import preprocessing

X_train = preprocessing.scale(X_train)

X_test = preprocessing.scale(X_test)

SVM using scikit learn runs endlessly and never completes execution

12 Answers

Further reading

Add your own answers!

Ask a Question