Why the regression coefficient for normalized continuous variable is unexpected when there is dummy variable in the model?

Question

I am doing a numerical experiment about linear regression modeling with presence of both continuous and categorical variables. As done in classical regression modeling practice, the categorical variable is firstly converted to several dummy variables, and part of which are retained for further modeling.
The model the numerical experiment followed is:
$$y=beta_0 + beta_1 x_2 + beta_2 z + varepsilon$$
where $beta_0=0.8$, $beta_1=-1.2$, $beta_2=1.3$. The first covariate $x$ is uniformly distributed, i.e. $x sim U(0, 1)$. The second covariate $z$ is a dummy variable, for which I drew from a standard normal distribution and convert it to a dummy variable by comparing it with 0, i.e. $z in {0, 1}$ (please see the MATLAB code given below). The error term $varepsilon$ is drawn from a standard normal distribution.
For comparison, the first covariate $x$ was transformed into a newly uniform distribution $x_2 sim U(1.2, 3)$.
Then I obtained the response y using the model above (note: The model used $x_2$ but not $x$ when producing $y$). And linear regression was conducted between $y sim x + z$, and $y sim x_2 + z$ in MATLAB. I did many experiments, and visualize the results as shown by the figure. I found that when the model is $y sim x_2 + z$, the coefficient $beta_1$ can be correctly estimated, but not as expected when the model is $y sim x + z$. For $beta_2$, regression of both the two models can give correct estimates.
My question is: when we do linear regression, whether should we normalize the data? What is the theoretical explanation for the results of the experiments above?
The following is my MATLAB code:
clear;
clc;

nbpt = 50;
res1 = zeros(nbpt, 1);
res2 = zeros(nbpt, 1);
N = 1000:1000:50000;
for inbobs = 1:nbpt
   nbobs = N(inbobs);
   ntrial = 100;
   temp1 = [];
   temp2 = [];

for i = 1:ntrial
       x = rand(nbobs, 1);
       m = 1.2;
       n = 3;
       x2 = 1.8*x + m;

z = randn(nbobs, 1);
       z = z > 0;

a = 0.8;
       b = -1.2;
       c = 1.3;
       y = a + b*x2 + c*z + randn(nbobs, 1);

X1 = [ones(nbobs, 1), x2, z];
       [b1, bint1, r1, rint1, stats1] = regress(y, X1);

X2 = [ones(nbobs, 1), x, z];
       [b2, bint2, r2, rint2, stats2] = regress(y, X2);

temp1 = [temp1; b1(2)];
       temp2 = [temp2; b2(2)];

end
  res1(inbobs, 1) = mean(temp1);
  res2(inbobs, 1) = mean(temp2);
end

figure;
subplot(1, 2, 1);
plot(N, res1, 'o-');ylim([-4, 4]);
subplot(1, 2, 2);
plot(N, res2, 'o-');ylim([-4, 4]);
axis tight;

abalter · Answer

Thank you for providing a MRE. Forgive me if I try to answer without directly working with it. It's been a while since I used Matlab, and never for stats.
Looking at your code, I see that you define the variable x2 from x1 with
m = 1.2;
x2 = 1.8*x + m;

Thus, the only difference between the two regression equations is
$$y = beta_0 + beta_1 x + beta_2 z + eta$$
and
$$begin{align}
y' & = beta_0' + beta_1' x + beta_2' z + eta \
& = (beta_0 + 1.2) + 1.8beta_1' x + beta_2' z + eta
end{align}$$
So, if the regression is done correctly, you should get the same value for  $beta_2$, and
$$beta_0' - beta_0 = 1.2$$
$$beta_1'/beta_1 = 1.8$$
If this is not what you are seeing, then you might have a mistake in your code.
Also, the fact that one of your plots looks constant and the other random is a bit suspect.
Here is a simple version in R. Please tell me if you think I have done the same simulation you intended to:

set.seed(1234)

N = 10000
b_0 = 0.8
b_1 = -1.2
b_2 = 1.3

x1 = runif(N, 0,1)
x2 = runif(N, 1.2, 3)

z = rnorm(N)>0

y1 = b_0 + b_1*x1 + b_2*z + rnorm(N)
y2 = b_0 + b_1*x2 + b_2*z + rnorm(N)

lm(y1 ~ x1 + z)
#> 
#> Call:
#> lm(formula = y1 ~ x1 + z)
#> 
#> Coefficients:
#> (Intercept)           x1        zTRUE  
#>       0.784       -1.203        1.344

lm(y2 ~ x2 + z)
#> 
#> Call:
#> lm(formula = y2 ~ x2 + z)
#> 
#> Coefficients:
#> (Intercept)           x2        zTRUE  
#>      0.7987      -1.1970       1.3120

Created on 2020-07-22 by the reprex package (v0.3.0)

Why the regression coefficient for normalized continuous variable is unexpected when there is dummy variable in the model?

One Answer

Add your own answers!

Ask a Question