Cross Validated Asked by emberbillow on December 5, 2021
I am doing a numerical experiment about linear regression modeling with presence of both continuous and categorical variables. As done in classical regression modeling practice, the categorical variable is firstly converted to several dummy variables, and part of which are retained for further modeling.
The model the numerical experiment followed is:
$$y=beta_0 + beta_1 x_2 + beta_2 z + varepsilon$$
where $beta_0=0.8$, $beta_1=-1.2$, $beta_2=1.3$. The first covariate $x$ is uniformly distributed, i.e. $x sim U(0, 1)$. The second covariate $z$ is a dummy variable, for which I drew from a standard normal distribution and convert it to a dummy variable by comparing it with 0, i.e. $z in {0, 1}$ (please see the MATLAB code given below). The error term $varepsilon$ is drawn from a standard normal distribution.
For comparison, the first covariate $x$ was transformed into a newly uniform distribution $x_2 sim U(1.2, 3)$.
Then I obtained the response y using the model above (note: The model used $x_2$ but not $x$ when producing $y$). And linear regression was conducted between $y sim x + z$, and $y sim x_2 + z$ in MATLAB. I did many experiments, and visualize the results as shown by the figure. I found that when the model is $y sim x_2 + z$, the coefficient $beta_1$ can be correctly estimated, but not as expected when the model is $y sim x + z$. For $beta_2$, regression of both the two models can give correct estimates.
My question is: when we do linear regression, whether should we normalize the data? What is the theoretical explanation for the results of the experiments above?
The following is my MATLAB code:
clear;
clc;
nbpt = 50;
res1 = zeros(nbpt, 1);
res2 = zeros(nbpt, 1);
N = 1000:1000:50000;
for inbobs = 1:nbpt
nbobs = N(inbobs);
ntrial = 100;
temp1 = [];
temp2 = [];
for i = 1:ntrial
x = rand(nbobs, 1);
m = 1.2;
n = 3;
x2 = 1.8*x + m;
z = randn(nbobs, 1);
z = z > 0;
a = 0.8;
b = -1.2;
c = 1.3;
y = a + b*x2 + c*z + randn(nbobs, 1);
X1 = [ones(nbobs, 1), x2, z];
[b1, bint1, r1, rint1, stats1] = regress(y, X1);
X2 = [ones(nbobs, 1), x, z];
[b2, bint2, r2, rint2, stats2] = regress(y, X2);
temp1 = [temp1; b1(2)];
temp2 = [temp2; b2(2)];
end
res1(inbobs, 1) = mean(temp1);
res2(inbobs, 1) = mean(temp2);
end
figure;
subplot(1, 2, 1);
plot(N, res1, 'o-');ylim([-4, 4]);
subplot(1, 2, 2);
plot(N, res2, 'o-');ylim([-4, 4]);
axis tight;
Thank you for providing a MRE. Forgive me if I try to answer without directly working with it. It's been a while since I used Matlab, and never for stats.
Looking at your code, I see that you define the variable x2
from x1
with
m = 1.2;
x2 = 1.8*x + m;
Thus, the only difference between the two regression equations is
$$y = beta_0 + beta_1 x + beta_2 z + eta$$
and
$$begin{align} y' & = beta_0' + beta_1' x + beta_2' z + eta \ & = (beta_0 + 1.2) + 1.8beta_1' x + beta_2' z + eta end{align}$$
So, if the regression is done correctly, you should get the same value for $beta_2$, and
$$beta_0' - beta_0 = 1.2$$
$$beta_1'/beta_1 = 1.8$$
If this is not what you are seeing, then you might have a mistake in your code.
Also, the fact that one of your plots looks constant and the other random is a bit suspect.
Here is a simple version in R. Please tell me if you think I have done the same simulation you intended to:
set.seed(1234)
N = 10000
b_0 = 0.8
b_1 = -1.2
b_2 = 1.3
x1 = runif(N, 0,1)
x2 = runif(N, 1.2, 3)
z = rnorm(N)>0
y1 = b_0 + b_1*x1 + b_2*z + rnorm(N)
y2 = b_0 + b_1*x2 + b_2*z + rnorm(N)
lm(y1 ~ x1 + z)
#>
#> Call:
#> lm(formula = y1 ~ x1 + z)
#>
#> Coefficients:
#> (Intercept) x1 zTRUE
#> 0.784 -1.203 1.344
lm(y2 ~ x2 + z)
#>
#> Call:
#> lm(formula = y2 ~ x2 + z)
#>
#> Coefficients:
#> (Intercept) x2 zTRUE
#> 0.7987 -1.1970 1.3120
Created on 2020-07-22 by the reprex package (v0.3.0)
Answered by abalter on December 5, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP