TransWikia.com

Where do I start? I have a massive dataset (from web scraping) and want to predict y from 20 variables

Data Science Asked on October 13, 2020

I’m trying to do my best to recreate an algorithm that a company is using by sucking in all of its outputs and adding in some relevant variables (I won’t be able to perfectly recreate it as I’m missing inventory data). I’m guessing this is a job for machine learning but I was wondering where I start? Are there specific models I should start with? Software I should use? Courses I should dive into?

One Answer

Since you are really new to all this, here are some suggestions.

Download and install R and RStudio Desktop.

Look at the book "Introduction to Statistical Learning". The book comes with code examples (in R) which are ideal to start learning. For you it will be useful to have a look at Chapter 2 to understand what statistical learning is all about and also to get a good idea of how to use R.

However, what is really relevant for you is Chapter 3 (linear regression) and Chapter 7 (moving beyond linearity).

Now you can get started. Here are some examples how regression can be done in R:

# Install a package if needed
# install.packages("ISLR")

# Lead packages
library(ISLR)
library(gam)
library(Metrics)

# Look at data
head(mtcars)
# Look at one variable in our dataframe
summary(mtcars$mpg)

# Make a test/train split
set.seed(123)
smp_size <- floor(0.75 * nrow(mtcars))
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]

# Linear regression
# Estimate mpg (y) dependent on ...
reg1 = lm(mpg~disp+wt+qsec+hp, data=train)
summary(reg1)

# Regression with polinomials
reg2 = lm(mpg~poly(disp,2)+poly(wt,2)+poly(qsec,2)+poly(hp,2), data=train)
summary(reg2)

# Generalised additive models (regression splines)
reg3 = gam(mpg~s(disp,3)+s(wt,3)+s(qsec,3)+s(hp,3),data=train)
par(mfrow=c(2,2))
plot(reg3)

# Predict mpg
pred1 = predict(reg1, newdata=test)
pred2 = predict(reg2, newdata=test)
pred3 = predict(reg3, newdata=test)

# See how well our model fits
mae(test$mpg, pred1)
mae(test$mpg, pred2)
mae(test$mpg, pred3)

Answered by Peter on October 13, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP