6. Multiple Linear Regression

Prediction Revisited

Earlier, we considered the predictive properties of the linear regression model: y = a + bx. Linear regression provides an excellent tool for comparison of one independent variable (x) with one dependent variable (y).

Unfortunately, many problems that we want to explore are more complicated than this simple, two component system.  As an example, consider the variables that influence crop yield.  These include weather variables (air temperature, total degree days, precipitation), soil variables (soil moisture, soil temperature, soil nutrients), and pest concentration, just to name a few.  In many cases, there are enough processes that contribute significantly to the variance of the dependent variable that simple linear regression will not provide a useful predictive tool.  Alternatively, you may be interested in exploring a system and may have identified a number of potentially influential factors (x) that influence y, but may not know the relative importance of each factor. How would you assess the relative importance of each of the independent x variables on the dependent y variable of interest?

Multiple linear regression provides a means of approaching these problems.  This extension of linear regression allows us to compare many x (independent) variables, with one y (dependent) variable.  The advantage of this approach is that it often provides better predictive capability than simple linear regression. The method also provides a crude estimate of the relative importance of each variable in the relationship.  While this method has many powerful advantages over simple linear regression, there are some potentially serious disadvantages that require careful attention if the user is to avoid potential misinterpretation.

The Multiple Linear Regression Model

If you recall that the general form of the simple linear regression model is:

y = a + bx.

where a and b are coefficients referred to as the intercept and slope, then it should not be too difficult to see that the general form of the multiple linear regression is:

y = a + b1x1 + b2x2 + b3x3 + ... + bkxk

This method is sometimes referred to as "multiple linear regression with non-zero intercept". An example of a multiple linear regression of this form is:

y = 5 + 15.2x1 + 17.7x2+ 46.9x3

Each independent variable in the regression model has its own "slope" (b) relative to the dependent variable for the given set of independent variables.  These b values are referred to as "regression coefficients" or "partial regression coefficients". The values of b1 through bk are found by least squares, in a method similar to the way the slope, b, is found in simple linear regression. We will not delve into the specific form of this relationship however. The intercept of the model (a), is sometimes written as bo.  This allows the model to be written as:

y = Sum[b(i-1)x]     (for i = 1, k+1)

Partial regression coefficients

It is important to keep in mind that the partial regression coefficient for each x term is influenced by the presence of the partial regression coefficients for each the other x variables in the equation. (Hence the term partial regression coefficient, since its influence on y is intimately related to the full set of regression coefficients.) The degree of influence the independent variables hold over each other depend on their level of correlation. Recall that the correlation between two variables can be related to the amount of shared variance they contain. Because of this, if one of the variables is removed from the equation, the b values for the remaining terms must change.  Think of this as a re-adjustment of how the variance of the x variables is related to the variance of the y variable by the multiple linear regression.  This is an important point that we will return to in the section discussing weaknesses of the method.

The mean and standard deviation of the x and y variables in generalized multiple linear regression are not removed before the analysis is conducted. This allow us to predict y values on the basis of the input x values. But because of this, the magnitude of the b values will differ greatly.  This makes it impossible to compare the relative importance of the partial regression coefficients in the equation by inspection alone.  For example, variable x1, with a mean and variance that is much greater than the mean and variance in y may have a very large b value, but could contribute less predictive power to the equation than variable x2 with mean and variance similar to y and a much smaller b value. Keep in mind also that the size of the partial correlation coefficient will vary depending on which other terms are included in the model!

Relative importance of partial regression coefficients

There are at least two ways to deal with this difficulty. The first is to multiply each partial correlation coefficient by the ratio of its standard deviation relative to the standard deviation of y:

Bk = bk(sk/sy)

In the Davis text, the symbol Bk is used to denote the partial correlation coefficients standardized in this way.

Alternatively, we can standardize the x and y matrices themselves by converting them to standardized z-scores as we did in the case of standardized linear regression.  The equation for this z-score based multiple linear regression approach is given by:

Zy = beta1Zx1 + beta2Zx2 + beta3Zx3 + .. + betakZxk

Kachigan uses the word "beta" to refer to the standardized partial regression coefficients in this special case (Don't confuse this "beta" with the true population partial regression coefficients which are referred to by the Greek letter  for "Beta".) As the z-scores have zero mean, the y-intercept of the model is the origin (i.e. a = 0).  Because of this feature, this form of regression is a special case of "multiple regression through the origin". (Note: Keep in mind that it is possible to conduct a "multiple regression through the origin" in which the mean is removed but the standard deviations do not equal 1 as in the z-score regression above. Thus not all cases of "multiple regression through the origin" are z-score multiple regressions.)

The big advantage of the z-score multiple linear regression method over generalized multiple linear regression is that we can compare the beta coefficients directly to assess their relative importance.  If we return to the example equation that we presented earlier:

y = 5 + 15.2x1 + 17.7x2 + 46.9x3

We could write this equation in terms of a z-score regression as:

Zy = 0.44Zx1 + 0.09Zx2 + 0.27Zx3

because the coefficents have been scaled by the ratio of standard deviations for each independent x variable relative to the dependent y variable. Notice that there is no intercept. The magnitude of the beta values indicates the relative importance of each variable in the equation.  Thus x1 has the greatest influence, x3 the second most importance and x2 the least importance. The ratio of squares of the beta terms given a measure of the variance explained by one variable relative to another. The z-score regression model defines the relationship between multiple linear correlation analysis, and multiple linear regression. It should be clear that the beta values represent the partial correlation coefficients, just as the slope in standardized simple linear regression is equal to the correlation coefficient. The betak values are identical to the partial correlation coefficients of multiple linear correlation analysis.  Soon we'll explore how multiple regression can be related to ANOVA.

Assumptions

There are several assumptions that must be met for the multiple linear regression to be valid:

1. The random x variables must be normally distributed and linearly related to y.

2. The x values (independent variables) must be free of error.

3. The variance of y (the dependent variable) as a function of the x variables must be constant. This is referred to as homoscedasticity.

4. The x variables must be relatively uncorrelated.

Generating a multiple regression

There are four possible strategies for determining which of the x variables to include in the regression model, although some of these methods preform much better than others.

The first strategy is to form a forced equation which includes all of the x terms.  This method is fatally flawed for reasons described in the potential problems section and should never be used.

The forward  regression model, starts by regressing y against the x variable with the greatest correlation to y, to determine a and b.  Then the x variable that explains the large fraction of residual variance in y is added to the regression, and new partial regression coefficients for the x's are determined by least squares.  If the magnitude of the residual variance in y decreases significantly, and the regression coefficients for both x's are still significant (as discussed below), then both terms are retained in the equation and the procedure is repeated for each additional x variable in the order in which they explain the remaining variance. The procedure stops when the addition of another term does not explain a significant amount of variance. The problem with this approach is that the final regression equation may depend on the order in which the x variables are entered into the equation.

The backward regression model suffers from the same problem.  In this procedure, the starting point is the forced regression in which all of the x variables are included, then at each step, the variable that explains the least amount of y variability is removed, until the point at which removing an additional term would result in a significant decrease in the amount of variance explained by the model.

The stepwise or all regression method is a much better procedure that takes full advantage of today's fast computers.  In this method the computer generates all of the possible regression equations for a given set of x variables and selects the regression that explains the greatest fraction of y-variance with significant partial correlation coefficients.
The number of possible equations depends on the number of x variables and is exactly 2k. The number is a power of two because inclusion or exclusion of each potential term in the equation is a binary function (i.e. term one is either in or not in the equation). For example, if there are there are three variables, then the total number of possible equations is given by 2 x 2 x 2 = 8 possible regressions.  When using this method the computer generates all of the possible regressions, then reports the best one term regression, the best two term regression, etc. The final regression reported is the model with the smallest significant number of terms needed to explain the largest fraction of variance in y.

Potential Problems

We saw earlier that the greatest difficulty with the simple linear regression model was the fact that the assumption that x was free of error leads to considerable uncertainty in the regression coefficients a and b.  We can assess the magnitude of this problem by determining the regression effect of the model, and by calculating a confidence interval for the slope b. In the case for multiple linear regression we can deal with this problem by calculating confidence intervals for each partial regression coefficient in the model.

There is however, a more dangerous problem that arises in multiple linear regression.  This problem is referred to as collinearity or multicollinearity. The magnitude of this problem depends on the degree of intercorrelation between the input variables in the regression. We have already touched on this problem above.  The danger of including several terms in the equation that are well correlated is that we will effectively be using the same x variance more than once to explain variance in y.  The best way to avoid this potential problem is to:

(1) Be aware of the degree of correlation between the x variables. Avoid using variables that are obviously highly correlated. (Later in the course, We'll learn a method to transform a matrix of correlated variables into a matrix of orthogonal or uncorrelated variables as a way to avoid this problem.)

(2) Always test the significance level of each regression term by evaluating its confidence interval.

(3) Use a stepwise procedure for model development.

(4) Always test a regression model using independent data to determine if the terms in the equation are valid. This is often referred to as a cal-val strategy in which the data set is split into two groups. Use one data set for calibration, and a second for validation. The larger the data set, the more sophisticated the cal-val approach that can be used. For example, we might repeatedly sub-sample the data set and generate many sets of beta coeficients to determine the uncertainty in beta.

A third problem inherent in multiple linear regression arises from a weakness of the least squares procedure. As we add each additional term to the equation, the remaining residual variance in y must always decreases, becase we minize a squared quantity (least-squares, remember!), which must have a positive result.  Some of this decrease is due to a true relationship, but some of it is due to random chance. Our best defense against this problem is to use the approaches described in 2 and 3 above.

Evaluating the Model Fit

It turns out that regression analysis can be related to ANOVA analysis because both of these tools are least squares methods. This works in the case for simple linear regression as well as for multiple linear regression. We can thus partition the variance in the regression model by means of sums of squares in the same way that we can for ANOVA. The results are tabulated in a regression ANOVA Table. The only difference between the one-way ANOVA table and the Regression ANOVA table is in the way the degrees of freedom are reported.

The Regression ANOVA Table

 Source  of  variance Sum  of  Squares Degrees  of freedom (df) Mean  Square F-ratio (F-statistic) Regression SSReg k MSReg = SSReg/k F=MSReg/MSE Error SSE N-k-1 MSE = SSE/(N-k-1) Total SSTo=SSReg+SSE N-1

In the one-way ANOVA table, the "treatment" degrees of freedom are reported as k-1 where k is the total number of groups. In the regression ANOVA table, the "Regression" source of variance is analogous to the "Treatment" source of variance in the one-way ANOVA. By convention, the degrees of freedom for the regression is equal to the number of x variables in the model, denoted k. Keep in mind that this definition of the degrees of freedom is one less than the total number of terms in the regression (recall there is an intercept in general regression), just as the degrees of freedom is one less than the number of groups in one-way ANOVA. It might have been better to define "k" in the regression case as the total number of terms in the equation - that is the sum of the number of b coefficients and the intercept, a. Then the regression degrees of freedom could be written as k-1. This would make the analogy to the one-way ANOVA exact.

The definition of the "Error" sums of squares SSE is identical in the two methods. The error degrees of freedom is given by N-k-1 where, capital N is the total number of observations. (This degree of freedom differs by one from the one-way ANOVA case for the reason described above. The mean squares are found by dividing the sum of squares by the degrees of freedom as in ANOVA analysis, and the F-ratio is formed and evaluated in the same way. Notice that the method of analysis is also valid for simple linear regression, where k=1, and that the regression and error degrees of freedom sum to the total degrees of freedom.

We calculate the terms in the regression ANOVA table as follows:

SSReg = SSTo-SSE

SSE = Sum[(y-yhat)2]

SSTo = Sum[(y-ybar)2]

Significance of the model

Significance of the model based on the F-ratio is determined in the same was as for ANOVA analysis. (See the lecture notes on ANOVA analysis for additional information).

A measure of the model error

Just as in simple linear regression, we can calculate the scatter of the predictions yhat about the best fit multiple regression model.  This is defined as

Se= RMSE = sqrt(MSE)

How much variance is explained?

As in simple linear regression we can determine the faction of variance explained by the regression. In simple linear regression this value was called the coefficient of determination and denoted r2.  In multiple linear regression we calculate R2 the coefficient of multiple determination which is defined as:

R2 = 1-[SSE/SSTo]

This calculation is simplified by use of the sums of squares from the regression ANOVA table. This definition of the variance explained by the multiple linear regression raises a questions. What  if we want to compare models with different numbers of terms to decide which may be more appropriate?  Recall that each time we add a new term to the equation, we decrease the residual variance and thus increase the explained variance.  So to compare R2 values for models with different numbers of terms is a bit unfair, because the R2 value for the model with a greater number of terms will always be at least a bit higher. To deal with this problem, the adjusted R2 values was invented.  This statistic is a modified version of the R2 value that weights the variance explained by the ratio of total number of observations to the number of variable coefficients used by the model.  The adjusted R2 value is  defined as:

Adjusted R2 = 1 - [ (N-1)/(N-k-1)]*[SSE/SSTo]

As the number of terms in the regression increases, the term [ (N-1)/(N-k-1)] becomes larger. This has the effect of increasing the measure of unexplained variance. As a result, for the adjusted R2 value of a model with k+1 terms to be greater than that of a model with only k terms, the amount by which the error variance decreases must be considerable. One oddity of the adjusted R2 value is that it can have negative values. Such a model is obviously flawed - the additional terms added to the model clearly do not help it to explain any additional variance.

Still, the concept of negative variance is strange enough that other approaches to compare between models may be appropriate. The F test provides one way of doing this since the degrees of freedom of the F statistic vary with the number of terms. Likewise, the method of orthogonal contrast can be used to assess the relative importance of the various terms. (See the lecture notes on ANOVA analysis for additional information). There are other approaches to this problem, but they are beyond the scope of this class.