Data Science Asked on June 26, 2021
I have the sales of items from January 2013 to October 2015. I just want to predict the total sales for the next month. Just for the sake of learning, I would like to transform it into a multiple regression model coded from scratch, without any libraries. So far, I’ve been able to get the betas but I don’t know how to get the prediction for the next month.
Here is the historical data for sales monthly from January 2013 to October 2015, ts
:
date_block_num
0 131479.0
1 128090.0
2 147142.0
3 107190.0
4 106970.0
5 125381.0
6 116966.0
7 125291.0
8 133332.0
9 127541.0
10 130009.0
11 183342.0
12 116899.0
13 109687.0
14 115297.0
15 96556.0
16 97790.0
17 97429.0
18 91280.0
19 102721.0
20 99208.0
21 107422.0
22 117845.0
23 168755.0
24 110971.0
25 84198.0
26 82014.0
27 77827.0
28 72295.0
29 64114.0
30 63187.0
31 66079.0
32 72843.0
33 71056.0
I tried to do a simple linear regression:
$$y_t = alpha + beta x_t +varepsilon$$
I first tried to estimate $alpha$ and $beta$ and then use predict(alpha,beta,34)
. So I did:
import random
def predict(alpha, beta, x_i):
return alpha+ beta * x_i
def error(alpha, beta, x_i, y_i):
"""the error from predicting beta * x_i + alpha
when the actual value is y_i"""
return y_i - predict(alpha, beta, x_i)
def sum_of_squarred_errors(alpha, beta, x, y):
return sum(errors(alpha, beta, x_i, y_i)**2
for x_i, y_i in zip(x,y))
def correlation(x,y):
stdev_x = standard_deviation(x)
stdev_y = standard_deviation(y)
if stdev_x > 0 and stdev_y >0:
return covariance(x,y)/ stdev_x/ stdev_y
else:
return 0
def least_squares_fit(x,y):
"""given training values for x and y
find the least-squares error for alpha and beta"""
beta = correlation(x,y) * standard_deviation(y)/ standard_deviation(x)
alpha = mean(y) - beta * mean(x)
return alpha, beta
def total_sum_squares(y):
"""the total squared variation of y_i's from their mean"""
return sum(v ** 2 for v in de_mean(y))
def r_squared(alpha, beta, x, y):
"""the fraction of variation of y in captured by the model, which equals
1 - the fraction of variation in y not catpured by the model"""
return 1.0 - (sum_squared_errors(alpha, beta, x, y)/
total_sum_of_squares(y))
r_squared(alpha, beta, num_friends_good, daily_minutes_good)
def squared_error(x_i, y_i, theta):
alpha, beta = theta
return error(alpha, beta, x_i, y_i) ** 2
def squared_error_gradient(x_i, y_i, theta):
alpha, beta = theta
return [-2 * error(alpha, beta, x_i, y_i),
-2 * error(alpha, beta, x_i, y_i) * x_i]
def in_random_order(data):
"""generator that returns the elements if data in random order"""
indexes = [i for i, _ in enumerate(data)] # create a list of indexes
random.shuffle(indexes) # suffle them
for i in indexes:
yield data[i]
def minimize_stochastic(target_fn, gradient_fn, x,y, theta_0, alpha_0=0.01):
print("x: ", x, "ny: ",y.tolist())
data = zip(x,y)
theta = theta_0 #initial guess
alpha = alpha_0 # initial step size
min_theta, min_value = None, float('inf') # the minimum so far
iterations_with_no_improvment = 0
# if we ever go 100 iterations with no improvment, stop
while iterations_with_no_improvment < 100:
value = sum(target_fn(x_i, y_i, theta) for x_i, y_i in data)
# print("value: ", value)
if value < min_value:
# if we've found a new minimum, remember it
# and go back to the original step size
min_theta, min_value = theta, value
iterations_with_no_improvment = 0
alpha = alpha_0
else:
# otherwise we're not improving, so try shrinking the step size
iterations_with_no_improvment +=1
alpha *=0.9
# and take a gradient step for each of the data points
# print("data: ", [x for x in data])
# print("data: ", data)
for x_i, y_i in in_random_order(data):
gradient_i = gradient_fn(x_i, y_i, theta)
theta = vector_substract(theta, scalar_multiply(alpha_gradient_i))
return min_theta
# choose random value to start
random.seed(0)
theta = [random.random(), random.random()]
alpha, beta = minimize_stochastic(squared_error,
squared_error_gradient, ts.index.values,
ts.values,
theta,
0.001)
print("alpha: ", alpha, "beta: ", beta)
But got super low alphas and betas:
alpha: 0.8444218515250481 beta: 0.7579544029403025
So the total sales for 34 (November 2015) are: 26.614871551495334 which looks impossible compared to 33 (October 2015): 71056.0
So did I messed up with the linear regression algorithm? My guess is that my random values to start with are maybe too low:
theta = [random.random(), random.random()]
Yet, they should increase anyway until there is no input anymore, isn’t it?
So how to chose initial thetas for a simple linear regression?
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP