Data Science Asked by Ben Williams on August 13, 2020
I’m creating a basic application to predict the ‘Closing’ value of a stock for day n+1, given features of stock n using Python and Scikit-learn
A sample row in my dataframe looks like this (2000 rows)
Open Close High Low Volume
0 537.40 537.10 541.55 530.47 52877.98
Similar to this video, where he uses ‘Dates’ and ‘Open Price’. In this example, Dates are the features and Open price is the target.
Now in my example, I don’t have a ‘Dates’ value in my dataset, but instead want to use Open, High, Low, Volume data as the features because I thought that would make it more accurate
I was defining my features and targets as so
features = df.loc[:,df.columns != 'Closing']
targets = df.loc[:,df.columns == 'Closing']
Which would return a df looking like this
features:
Open High Low Vol from
29 670.02 685.11 661.09 92227.36
targets:
Close
29 674.57
However I realised that the data needs to be in a numpy array, so I now get my features and targets like this
features = df.loc[:,df.columns != 'Closing'].values
targets = df.loc[:,df.columns == 'Closing'].values
So now my features look like this
[6.70020000e+02 6.85110000e+02 6.61090000e+02 9.22273600e+04
6.23944806e+07]
[7.78102000e+03 8.10087000e+03 7.67541000e+03 6.86188500e+04
5.41391322e+08]
and my targets look like this
[ 674.57]
[ 8042.64]
I then split up my data using
X_training, X_testing, y_training, y_testing = train_test_split(features, targets, test_size=0.8)
I tried to follow the Scikit-Learn documentation, which resulted in the following
svr_rbf = svm.SVR(kernel='rbf', C=100.0, gamma=0.0004, epsilon= 0.01 )
svr_rbf.fit(X_training, y_training)
predictions = svr_rbf.predict(X_testing)
print(predictions)
I assumed that this would predict the Y values given the testing features, which I could then plot against the actual y_testing values to see how similar they are. However, the predictions is printing out the same value for each X_testing feature.
[3763.84681818 3763.84681818 3763.84681818 3763.84681818 3763.84681818
I’ve tried changing the value of epsilon, c and gamma but that doesnt seem to change the fact that the predictions always gives the same value
I know that it might not be accurate to predict stock prices, but I must have done something wrong to get the same value when applying the model to various different test data
There are a couple of parts that I think changing will help.
First, a general one for all model building: I would suggest you scale your data before putting it into the model.
It might not directly solve the problem of receiving the same predicted value in each step, but you might notice that you predictions lie somewhere in the ranges of your input values - as you are using unscaled volume, that is making things difficult for the model. It is essentially have to work on two different scales at the same time, which is cannot do very well.
Have a look at the StandardScaler in sklean for a way how to do that.
Next a few suggestions of things to change, specifically because you are working with stock prices:
I would normally predict the value of the stock market tomorrow, and not the closing prices on the same data, where you are using open/high/low/volume. For me that only make sense if you were to have high-frequency (intraday) data.
Given this, you would need to shift your y
value by one step. There is a method on Pandas DataFrames to help with that, but as you dont have a date
column and you only need to shift by one timestep anyway, you can just do this:
features = df.loc[:-1, df.columns != 'Closing'].values # leave out last step
targets = df.loc[1:, df.columns == 'Closing'].values # start one step later
You could then even then predict the opening price of the following day, or keep closing
data in the features
data, as that would not introduce temporal bias.
Something that would require more setup, would be to look at shuffling your data. Again, because you want to use historical values to predict future ones, you need to keep the relevant hsitory together. Have a look at my other answer to this question and the diagram, which explains more about this idea.
You should also scale y_train
and y_test
, so that the model knows to predict within that range. Do this using the same StandardScaler
instance, as not to introduce bias. Have a look at this short tutorial. Your predictions will then be within the same range (e.g. [-1, +1]
). You can compute errors on that range too. If you really want, you can then scale your predictions back to the original range so they look more realistic, but that isn't really necessary to validate the model. You can simply plot the predictions against ground truth in the scaled space.
Check out this thread, which explains a few reasons as to why you should use the same instance of StandardScaler
on the test data.
Correct answer by n1k31t4 on August 13, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP