**Prediction
revisited**

**The multiple
linear regression
model**

**Assumptions**

**Generating a
multiple regression**

**Potential problems**

**Evaluating the fit**

** The
Regression
ANOVA Table**

** Determining
statistical
significance**

** A
measure
of the model error**

** How
much variance
is explained?**

Earlier, we considered the predictive properties of the linear regression model: y = a + bx. Linear regression provides an excellent tool for comparison of one independent variable (x) with one dependent variable (y).

Unfortunately, many problems that we want to explore are more complicated than this simple, two component system. As an example, consider the variables that influence crop yield. These include weather variables (air temperature, total degree days, precipitation), soil variables (soil moisture, soil temperature, soil nutrients), and pest concentration, just to name a few. In many cases, there are enough processes that contribute significantly to the variance of the dependent variable that simple linear regression will not provide a useful predictive tool. Alternatively, you may be interested in exploring a system and may have identified a number of potentially influential factors (x) that influence y, but may not know the relative importance of each factor. How would you assess the relative importance of each of the independent x variables on the dependent y variable of interest?

Multiple linear regression provides a means of approaching these problems. This extension of linear regression allows us to compare many x (independent) variables, with one y (dependent) variable. The advantage of this approach is that it often provides better predictive capability than simple linear regression. The method also provides a crude estimate of the relative importance of each variable in the relationship. While this method has many powerful advantages over simple linear regression, there are some potentially serious disadvantages that require careful attention if the user is to avoid potential misinterpretation.

If you recall that the general form of the simple linear regression model is:

y = a + bx.

where a and b are coefficients referred to as the intercept and slope, then it should not be too difficult to see that the general form of the multiple linear regression is:

y = a + b_{1}x_{1} + b_{2}x_{2}
+ b_{3}x_{3} + ... + b_{k}x_{k}

This method is sometimes referred to as "multiple linear regression with non-zero intercept". An example of a multiple linear regression of this form is:

y = 5 + 15.2x_{1} + 17.7x_{2}+
46.9x_{3}

Each independent variable in the regression model
has
its own "slope" (b) relative to the dependent variable for the
given set
of independent variables. These b values are referred to
as "regression coefficients" or "partial regression
coefficients". The
values of b_{1} through b_{k} are found by least
squares,
in a method similar to the way the slope, b, is found in simple
linear
regression. We will not delve into the specific form of this
relationship
however. The intercept of the model (a), is sometimes written as
b_{o}.
This allows the model to be written as:

y = Sum[b_{(i-1)}x]
(for
i = 1, k+1)

**Partial regression coefficients**

It is important to keep in mind that the partial regression coefficient for each x term is influenced by the presence of the partial regression coefficients for each the other x variables in the equation. (Hence the term partial regression coefficient, since its influence on y is intimately related to the full set of regression coefficients.) The degree of influence the independent variables hold over each other depend on their level of correlation. Recall that the correlation between two variables can be related to the amount of shared variance they contain. Because of this, if one of the variables is removed from the equation, the b values for the remaining terms must change. Think of this as a re-adjustment of how the variance of the x variables is related to the variance of the y variable by the multiple linear regression. This is an important point that we will return to in the section discussing weaknesses of the method.

The mean and standard deviation of the x and y
variables
in generalized multiple linear regression are not removed before
the analysis
is conducted. This allow us to predict y values on the basis of
the input
x values. But because of this, the magnitude of the b values
will differ
greatly. This makes it impossible to compare the relative
importance
of the partial regression coefficients in the equation by
inspection alone.
For example, variable x_{1}, with a mean and variance
that is much
greater than the mean and variance in y may have a very large b
value,
but could contribute less predictive power to the equation than
variable
x_{2} with mean and variance similar to y and a much
smaller b
value. Keep in mind also that the size of the partial
correlation coefficient
will vary depending on which other terms are included in the
model!

**Relative importance of partial regression
coefficients**

There are at least two ways to deal with this difficulty. The first is to multiply each partial correlation coefficient by the ratio of its standard deviation relative to the standard deviation of y:

B_{k} = b_{k}(s_{k}/s_{y})

In the Davis text, the symbol B_{k} is
used to
denote the partial correlation coefficients standardized in this
way.

Alternatively, we can standardize the x and y matrices themselves by converting them to standardized z-scores as we did in the case of standardized linear regression. The equation for this z-score based multiple linear regression approach is given by:

Z_{y} = beta_{1}Z_{x1}
+
beta_{2}Z_{x2} + beta_{3}Z_{x3}
+ .. + beta_{k}Z_{xk}

Kachigan uses the word "beta" to refer to the standardized partial regression coefficients in this special case (Don't confuse this "beta" with the true population partial regression coefficients which are referred to by the Greek letter for "Beta".) As the z-scores have zero mean, the y-intercept of the model is the origin (i.e. a = 0). Because of this feature, this form of regression is a special case of "multiple regression through the origin". (Note: Keep in mind that it is possible to conduct a "multiple regression through the origin" in which the mean is removed but the standard deviations do not equal 1 as in the z-score regression above. Thus not all cases of "multiple regression through the origin" are z-score multiple regressions.)

The big advantage of the z-score multiple linear regression method over generalized multiple linear regression is that we can compare the beta coefficients directly to assess their relative importance. If we return to the example equation that we presented earlier:

y = 5 + 15.2x_{1} + 17.7x_{2 }+
46.9x_{3}

We could write this equation in terms of a z-score regression as:

Z_{y} = 0.44Z_{x1} + 0.09Z_{x2}
+ 0.27Z_{x3}

because the coefficents have been scaled by the
ratio of standard deviations for each independent x variable
relative to the dependent y variable. Notice that there is no
intercept. The magnitude of the
beta values indicates the relative importance of each variable
in the equation.
Thus x_{1} has the greatest influence, x_{3} the
second
most importance and x_{2} the least importance. The
ratio of squares
of the beta terms given a measure of the variance explained by
one variable
relative to another. The z-score regression model defines the
relationship
between multiple linear correlation analysis, and multiple
linear regression.
It should be clear that the beta values represent the partial
correlation
coefficients, just as the slope in standardized simple linear
regression
is equal to the correlation coefficient. The beta_{k}
values are
identical to the partial correlation coefficients of multiple
linear correlation
analysis. Soon we'll explore how multiple regression can
be related
to ANOVA.

There are several assumptions that must be met for the multiple linear regression to be valid:

1. The random x variables must be normally distributed and linearly related to y.

2. The x values (independent variables) must be free of error.

3. The variance of y (the dependent variable) as
a function
of the x variables must be constant. This is referred to as ** homoscedasticity**.

4. The x variables must be relatively
uncorrelated.

There are four possible strategies for determining which of the x variables to include in the regression model, although some of these methods preform much better than others.

The first strategy is to form a **forced
equation**
which includes all of the x terms. This method is fatally
flawed
for reasons described in the potential
problems section and
should never be used.

The **forward** regression model,
starts by regressing
y against the x variable with the greatest correlation to y, to
determine
a and b. Then the x variable that explains the large
fraction of
residual variance in y is added to the regression, and new
partial regression
coefficients for the x's are determined by least squares.
If the
magnitude of the residual variance in y decreases significantly,
and the
regression coefficients for both x's are still significant (as
discussed
below), then both terms are retained in the equation and the
procedure
is repeated for each additional x variable in the order in which
they explain
the remaining variance. The procedure stops when the addition of
another
term does not explain a significant amount of variance. The
problem with
this approach is that the final regression equation may depend
on the **order**
in which the x variables are entered into the equation.

The **backward** regression model suffers
from the
same problem. In this procedure, the starting point is the
**forced
regression** in which all of the x variables are included,
then at each
step, the variable that explains the least amount of y
variability is removed,
until the point at which removing an additional term would
result in a
significant decrease in the amount of variance explained by the
model.

The **stepwise** or **all regression**
method is
a much better procedure that takes full advantage of today's
fast computers.
In this method the computer generates all of the possible
regression equations
for a given set of x variables and selects the regression that
explains
the greatest fraction of y-variance with significant partial
correlation
coefficients.

The number of possible equations depends on the
number
of x variables and is exactly 2^{k}. The number is a
power of two
because inclusion or exclusion of each potential term in the
equation is
a binary function (i.e. term one is either in or not in the
equation).
For example, if there are there are three variables, then the
total number
of possible equations is given by 2 x 2 x 2 = 8 possible
regressions.
When using this method the computer generates all of the
possible regressions,
then reports the best one term regression, the best two term
regression,
etc. The final regression reported is the model with the
smallest significant
number of terms needed to explain the largest fraction of
variance in y.

We saw earlier that the greatest difficulty with the simple linear regression model was the fact that the assumption that x was free of error leads to considerable uncertainty in the regression coefficients a and b. We can assess the magnitude of this problem by determining the regression effect of the model, and by calculating a confidence interval for the slope b. In the case for multiple linear regression we can deal with this problem by calculating confidence intervals for each partial regression coefficient in the model.

There is however, a more dangerous problem that arises in multiple linear regression. This problem is referred to as collinearity or multicollinearity. The magnitude of this problem depends on the degree of intercorrelation between the input variables in the regression. We have already touched on this problem above. The danger of including several terms in the equation that are well correlated is that we will effectively be using the same x variance more than once to explain variance in y. The best way to avoid this potential problem is to:

(1) Be aware of the degree of correlation between the x variables. Avoid using variables that are obviously highly correlated. (Later in the course, We'll learn a method to transform a matrix of correlated variables into a matrix of orthogonal or uncorrelated variables as a way to avoid this problem.)

(2) Always test the significance level of each regression term by evaluating its confidence interval.

(3) Use a stepwise procedure for model development.

(4) Always test a regression model using independent data to determine if the terms in the equation are valid. This is often referred to as a cal-val strategy in which the data set is split into two groups. Use one data set for calibration, and a second for validation. The larger the data set, the more sophisticated the cal-val approach that can be used. For example, we might repeatedly sub-sample the data set and generate many sets of beta coeficients to determine the uncertainty in beta.

A third problem inherent in multiple linear regression arises from a weakness of the least squares procedure. As we add each additional term to the equation, the remaining residual variance in y must always decreases, becase we minize a squared quantity (least-squares, remember!), which must have a positive result. Some of this decrease is due to a true relationship, but some of it is due to random chance. Our best defense against this problem is to use the approaches described in 2 and 3 above.

It turns out that regression analysis can be related to ANOVA analysis because both of these tools are least squares methods. This works in the case for simple linear regression as well as for multiple linear regression. We can thus partition the variance in the regression model by means of sums of squares in the same way that we can for ANOVA. The results are tabulated in a regression ANOVA Table. The only difference between the one-way ANOVA table and the Regression ANOVA table is in the way the degrees of freedom are reported.

Source
of
variance |
Sum
of
Squares |
Degrees
of freedom
(df) |
Mean
Square |
F-ratio
(F-statistic) |

Regression | SSReg | k | MSReg = SSReg/k | F=MSReg/MSE |

Error | SSE | N-k-1 | MSE = SSE/(N-k-1) | |

Total | SSTo=SSReg+SSE | N-1 |

In the one-way ANOVA table, the "treatment"
degrees of freedom
are reported as k-1 where k is the total number of groups. In
the regression
ANOVA table, the "Regression" source of variance is analogous to
the "Treatment"
source of variance in the one-way ANOVA. By convention, the
degrees of
freedom for the regression is equal to the number of **x
variables in
the model**, denoted k. Keep in mind that this definition of
the degrees
of freedom is one less than the total number of terms in the
regression
(recall there is an intercept in general regression), just as
the degrees
of freedom is one less than the number of groups in one-way
ANOVA. It might
have been better to define "k" in the regression case as the **total
number
of terms in the equation** - that is the sum of the number
of b coefficients and the intercept,
a. Then the regression degrees of freedom could be written as
k-1. This
would make the analogy to the one-way ANOVA exact.

The definition of the "Error" sums of squares SSE is identical in the two methods. The error degrees of freedom is given by N-k-1 where, capital N is the total number of observations. (This degree of freedom differs by one from the one-way ANOVA case for the reason described above. The mean squares are found by dividing the sum of squares by the degrees of freedom as in ANOVA analysis, and the F-ratio is formed and evaluated in the same way. Notice that the method of analysis is also valid for simple linear regression, where k=1, and that the regression and error degrees of freedom sum to the total degrees of freedom.

We calculate the terms in the regression ANOVA table as follows:

SSReg = SSTo-SSE

SSE = Sum[(y-yhat)^{2}]

SSTo = Sum[(y-ybar)^{2}]

Significance of the model based on the F-ratio is determined in the same was as for ANOVA analysis. (See the lecture notes on ANOVA analysis for additional information).

Just as in simple linear regression, we can calculate the scatter of the predictions yhat about the best fit multiple regression model. This is defined as

S_{e}= RMSE = sqrt(MSE)

**How much variance is
explained?**

As in simple linear regression we can determine
the faction
of variance explained by the regression. In simple linear
regression this
value was called the coefficient of determination and denoted r^{2}.
In
multiple linear regression we calculate R^{2} the
coefficient
of multiple determination which is defined as:

R^{2} = 1-[SSE/SSTo]

This calculation is simplified by use of the sums
of squares
from the regression ANOVA table. This definition of the variance
explained
by the multiple linear regression raises a questions. What
if we
want to compare models with different numbers of terms to decide
which may
be more appropriate? Recall that each time we add a new
term to the
equation, we decrease the residual variance and thus increase
the explained
variance. So to compare R^{2} values for models
with different
numbers of terms is a bit unfair, because the R^{2}
value for the
model with a greater number of terms will always be at least a
bit higher.
To deal with this problem, the adjusted R^{2} values was
invented.
This statistic is a modified version of the R^{2} value
that weights
the variance explained by the ratio of total number of
observations to
the number of variable coefficients used by the model. The
adjusted
R^{2} value is defined as:

Adjusted R^{2} = 1 - [
(N-1)/(N-k-1)]*[SSE/SSTo]

As the number of terms in the regression
increases, the
term [ (N-1)/(N-k-1)] becomes larger. This has the effect of
increasing
the measure of unexplained variance. As a result, for the
adjusted R^{2}
value of a model with k+1 terms to be greater than that of a
model with
only k terms, the amount by which the error variance decreases
must be considerable.
One oddity of the adjusted R^{2} value is that it can
have negative
values. Such a model is obviously flawed - the additional terms
added to
the model clearly do not help it to explain any additional
variance.

Still, the concept of negative variance is strange enough that other approaches to compare between models may be appropriate. The F test provides one way of doing this since the degrees of freedom of the F statistic vary with the number of terms. Likewise, the method of orthogonal contrast can be used to assess the relative importance of the various terms. (See the lecture notes on ANOVA analysis for additional information). There are other approaches to this problem, but they are beyond the scope of this class.