5. Analysis of Variance

Introduction to Analysis of Variance (ANOVA)
Single-factor ANOVA with equal sample size, n
The ANOVA table
Evaluating the model: The F-distribution and F-test
Multiple comparisons
    Orthogonal contrasts
    Confidence intervals
    Partial F-tests
    Variance explained



Introduction to Analysis of Variance (ANOVA)

The univariate methods we have considered up to this point allow us to compare a sample mean with a hypothesized population mean, or to compare two sample population means. However, many of the situations we encounter in the real world will require the ability to compare more than two populations.

For example, we may want to compare the mineral content of samples from four mines to determine if there is a difference in ore concentration. Or we may want to test three methods of water quality testing to see if they yield identical test results when used on replicate samples. Or we may wish to compare mean grain size from five sediment samples along a beach to see if they come from the same sediment source (i.e. have identical provenance). In fact, the number of groups we choose to compare is entirely general and thus limited only by the structure of the problem at hand and our computational resources.

All of these questions can be posed as a null hypothesis of the form,


against an alternative hypothesis,

                    Ha: at least two means differ

(Note the alternative hypothesis is sometimes stated as "at least one mean differs")

Here mu1 represents the sample mean for group 1, etc, and muk (read as "mu of k", not "muck"), represents the sample mean of group k.

To test this set of hypotheses using a series of two sample t-tests would quickly get unwieldy due to the large number of comparisons that would have to be conducted.  Fortunately, there is an easier way to make all of these comparisons at once.  The method relies on comparing the variance within each group (or treatment) against the variance between groups. These two measures of variance provide slightly different estimates of the true population variance. It turns out that one is biased, but the other is unbiased.  If the level of bias is small, then it stands to reason that the two estimates will be very similar, in which case we will fail to reject the null hypothesis.  If the level of bias is large, then it is unlikely that all of the sample means are identical and we reject the null hypothesis in favor of the alternative.

This is the essence of the statistical method referred to as "Analysis of Variance" or ANOVA for short. ANOVA can be used with either observational or experimental data, but is particularly well suited to experimental studies.

There are actually a number of related methods that fall under the general name of "Analysis of Variance". We will deal with the most basic of these methods which is strictly referred to as "Single-factor" or "One-way" ANOVA. Variations on this method include: randomized block design, "Two-factor" or "Two-way" ANOVA, Analysis of Covariance (ANCOVA), Multiple Analysis of Covariance (MANCOVA), and Multiple Analysis of Variance (MANOVA). These methods will not be discussed here but are mentioned so that you know that they exist.

In the sections that follow we will outline a formal Single-factor ANOVA procedure for testing the above hypothesis against it's alternative, and for evaluating the statistical significance of the test. This will require the definition of several new statistics and the introduction of another important family of probability distribution functions, the F-distribution.
 



Single-factor ANOVA with equal sample size, n

The foundation for Single-factor" or "One-way" ANOVA as a means of differentiating between a series of group means is the belief that samples taken from the same population will have similar variance, and that the total variance for a set of samples can be partitioned. The basic assumptions of the method are thus: (1) that the data from each group is normally distributed, and (2) that the true population variance for the groups is equal. In the procedure described here the number of samples per group, n, is equal for all groups.  The procedure can however be generalized for groups with sample sizes that differ. That topic is beyond the scope of our presentation.

In practice, random errors will lead to group variances that differ from the true population variance by unknown amounts.  How different can the group variances be before we violate the second assumption? When working with small sized samples taken from a single population, it is not uncommon to observe sample variances that differ by up to a factor of 3 from each other (Devore and Peck, 1986).

The Single-factor ANOVA gets its name from the fact that one process, variable or "factor" operates on data from multiple groups. To return to our water quality example, we may have three different methods of estimating the concentration of a single pollutant in multiple water samples from a single well. This experimental design could be summarized by the following table: 


As stated above, n1=n2=n3 for this version of ANOVA. Note that while in our example k=3, there can be any number of groups or treatments specified. Our objective will be to partition the total variance of this data set into that part which is due the the method of treatment (and thus can be explained) against that part which arises from random sampling error and thus cannot be explained.  This is effectively a measure of the signal to noise ratio of the process or variables that we are studying. We will accomplish this by calculating two sums of squares which have specific degrees of freedom (df) associated with them. This should make sense as we already know that least squares methods are BLUE (Best Unbiased Linear Estimators).  Once we have calculated the two sums of squares, we will normalize them by their respective degrees of freedom to obtain two mean squares and then calculate a new test statistic to evaluate the significance level of the test. What we want to accomplish by doing this, is to obtain two independent measures of the true population variance, sigma.

The between group estimate of the true population variance

The first of these two measures is referred to as the mean square for treatments.  There are many different terms used to describe this quantity. We will denote it MSTr. This mean square is also know as the mean square for groups (denoted MSG or just "Groups"), or as the between group estimate of sigma (Sigma_bg). It is defined as follows:


To obtain this estimate, we first calculate the Sums of Squares for treatments (SSTr):


which depends on the group average (Xbar) for each of the k groups, and the grand average calculated over the entire dataset (x-double bar) as definded below:

                 


We then normalize the SSTr by the
(k-1) degrees of freedom for the groups to obtain the Mean Squares for Treatment (MSTr):


The MSTr is thus normalized by n/(k-1). The sums of squares in the MSTr are divided by df=(k-1) as is normally done when calculating the standard error of the mean, but why is it multiplied by n?  Recall from the Central Limits Theorem that the sample standard error of the mean is equal to the population variance divided by the sample size n. It thus follows that we can re-arrange this to obtain an expression for the true population mean that must be scaled by n:



Hence, we must multiply by a factor of n in the definition of the MSTr to obtain the best estimate of the population variance that is available. Even so, the MSTr will not provide us with an unbiased estimate of the true population variance because it includes variance due to differences in the group or treatment means, as well as the random sampling errors incorporated in each observation.

The within group estimate of the true population variance

The second of the two measures of the population variance is referred to as the mean square for error.  We will denote it MSE. This mean square is also know as the "within group" estimate of sigma (denoted Sigma_w, or Within). The variance for each group provides an unbiased estimate of the true population variance. We can obtain a better estimate by pooling these separate estimates.  This procedure increases the available degrees of freedom of the resulting sum of squares.

By definition:


where k is the total number of groups and the s's are the variance estimates for each group. Unlike the MSTr, the MSE provides an unbiased estimate of the true population variance. We will see that when the MSTr is significantly larger than the MSE,  we have reason to reject the null hypothesis in favor of the alternative hypothesis, (i.e. at least two of the means differ).

In practice, the definitions above are awkward to calculate. In the next section will will learn some alternative ways to calculate the MStr and MSE that make ANOVA easier to work with.



The ANOVA table

The results of a Single-Factor ANOVA can be summarized using an Analysis of Variance (ANOVA) table.  The components of the ANOVA table are:


where: k = number of groups; N= total number of observations, and the grand mean over all observations = Xdouble-bar

It is important to point out that in addition to the relationships listed above, that SSTo = SSTr + SSE. (This can be seen by re-arranging the relationship given for SSE), or by noting that df of N-1 = (N-K) + (k-1), and thus the variances must be additive as well. This relationship provides a useful check on our calculations and makes finding the SSE easier. It is also interesting to note that for the case where k=2, the square root of the MSE, denoted RMSE, is identical to the standard error of the regression statistic that we learned about as part of simple linear regression analysis.  That is why both of these statistics are also referred to as the RMSE. We shall see that this relationship can be generalized when we consider the multiple linear regression case.

Some important relationships that simplify the calculations follow:








Evaluating the model: The F-distribution and F-test

We stated above that if the MSTr is significantly larger than MSE, we have reason to reject the null hypothesis in favor of the alternative hypothesis, (i.e. at least two of the means differ). To do this, we will need to have a reliable test statistic, and we will need to understand the characteristics of its probability density function (pdf). The statistic we will use is referred to as F, the F-ratio, the F-statistic, or the F-value.

It is defined as:


As we have seen above, this is a ratio of variances. The numerator provides a measure of the variance of the signal we seek to resolve, while the denominator provides a measure of variance of the error. We can thus think of this statistic as a measure of the signal-to-noise ratio of the ANOVA analysis! The ANOVA F-ratio has (k-1) degrees of freedom in the numerator and (N-k) degrees of freedom in the denominator. These are reported as df = [(k-1),(N-k)]. The F-distribution has both numerator and denominator degrees of freedom because it is based on the ratio of two sampling distributions, the sampling distribution for the variance of the MSTr and the sampling distribution for the variance of the MSE.  Because it is possible to have many different combinations of numerator and denominator degrees of freedom, there is actually a very large family of F-distributions, and thus tables of the F-ratio are also generally very large. The exact shape of the pdf for the F-distribution varies with its degrees of freedom, but in general, the distribution can be described as skewed to the right.

What is the expected value of the F-ratio?  In the limit as the numerator and denominator degrees of freedom become very large, the MSTr and MSE will both converge on the true population variance, and the F-ratio will converge on 1.  Thus the further the observed value of the F-ratio is from 1, the greater the probability that MSTr and MSE differ, and thus at least two of the population means differ in the ANOVA test differ.  This is the basis for the F-test. The fact that the F-distributions are skewed and thus have different area in their upper and lower tails, does not pose a problem because the F-test is always performed as a single sided test. (This must be the case because it is a ratio of squared values and thus is postive definite). The F-Table lists critical F-values for various combinations of F-ratio degrees of freedom (typically at value for alpha = 0.10, 0.05, 0.01, and 0.001, although fewer alpha levels may be included in some tables).

Summary of ANOVA and the F-test

The objective of ANOVA analysis is to evaluate a difference in means (due to differences in sample treatment) by comparison of two estimates of the population variance, the MSTr and the MSE.

This enables us to test the null hypothesis:


against an alternative hypothesis,

                    Ha: at least two means differ

The test statistic is the F-ratio, defined as F=MSTr/MSE which has df = (k-1),(N-k). The significance of the F-test is evaluated by comparison of the sample F-ratio against a critical F-ratio from the F-distribution with df = (k-1),(N-k). If the sample F-ratio exceeds the critical F-ratio, we reject the null hypothesis at the significance level of the tabulated critical F-ratios.



Multiple comparisons

ANOVA is a very useful method because it allows us to conduct a single test to compare the potential influence of many different, but related treatments. But how do we determine the importance of each treatment relative to the others? This topic can become very complicated, and as pointed out in the text, no single procedure has overall acceptance.

The case where Ho is not rejected

In the case where Ho is not rejected, as pointed out in Kachigan, there are three schools of thought on how to proceed:

(1) Do nothing further - because we failed to reject Ho, further tests are unwarranted.
(2) Any test of means or combinations of means is acceptable, provided that the tests were planned as part of the experimental design prior to the collection of the data.
(3) Any and all comparison of means or combinations of means that look promising should be investigated.

In the case where Ho is not rejected, the first approach is the most conservative, and should probably be observed. The danger with the second approach is that it potentially  imposes a non-random bias on the model design. Specifically, if we are not particularly careful with our choice of which test(s) to conduct or not conduct, the results will be influenced by our pre-conceived notion of how the processes we are studying may operate.  This, combined with random error, could lead to the occurrence of a Type I or Type II error during the ensuing hypothesis test.  The third approach is the least conservative. It is often referred to as a "fishing expedition" and should be avoided at all costs. The possibility of encountering a false positive (a Type I error) by random chance is great if we are conducting many comparisons.  Indeed, this can be used as an argument for conducting the ANOVA analysis in the first place! Since we were unable to reject Ho by means of ANOVA, it is probably best to reconsider the processes under study and collect additional data under a re-designed experimental procedure.

The case where Ho is rejected

The case where Ho is rejected is perhaps more troublesome. For now we have good reason to suspect a difference in means and thus a difference between groups or treatments. The problem is that the ANOVA does not tell us which means or combination of means is the source of the difference! When Ho is rejected, the first approach described above is not
appropriate, and the third approach remains dangerous.  We conclude that some type of planned comparison(s) is the appropriate choice.  The question becomes which comparisons are appropriate?

The methods of orthogonal contrasts

Kachigan describes a procedure for multiple comparisons based upon finding "orthogonal" contrasts of means and combinations of means.  A contrast is said to be orthogonal if its weighting coefficients sum to zero.  Consider the following example from the text. If we wish to compare xbar1-xbar2, the weighting coefficients for the two means are 1 and -1, which sum to zero and are thus orthogonal.  If we consider a contrast of xbar1 - (xbar2+xbar3)/2, this is equal to +1(xbar1) + -1/2(xbar2) + -1/2(xbar3). The weighting coefficients are thus +1, -1/2 and -1/2. These weights also sum to zero and are thus orthogonal. To determine if two different contrasts of means from the same experiment are orthogonal, add the product of the weights to see if they sum to zero.  If they do not sum to zero, then the two contrasts are not orthogonal and only one of them should be tested.

When using contrasts, it is convenient to multiply the weights by a constant factor to remove any fractions. For example, the weights +1, -1/2, and -1/2 described above can be written as +2, -1, -1 by multiplying through by 2. This simplifies things and will not influence the statistical treatment that will follow.

It can be show that for k groups, there are k-1 orthogonal contrasts. Let's consider the case where k=4.  There are 3 orthogonal contrasts. If we first multiply the weights by an appropriate scaling factor, we can write the 3 orthogonal contrasts out in the form of a table:
 

xbar1 xbar2 xbar3 xbar4
contrast 1 3 -1 -1 -1
contrast 2 0 2 -1 -1
contrast 3 0 0 1 -1

The first contrast compares the mean of group 1 against the difference of the combined means of groups 2-4. The second contrast compares the mean of group 2 against the difference of the combined means of group 3 and 4. The final orthogonal contrast that can be made compares the mean of group three by difference against the mean of group 4.

Once this is done, each mean has been effectively compared against all of the other means in the experiment.  There are no remaining independent ways of comparing the data. How can we generalize this procedure? Notice that the largest positive scaled weight for the first contrast is equal to k-1, the largest positive scaled weight for the second contrast is k-2, and the largest positive scaled weight for the final contrast is k-3. Notice that as we form a new contrast, we also omit the group with the positive scaled weight from the prior contrast. Finally, the scaled weights for the remaining means in a contrast
must all be equal to -1 to balance the positive scaled weight.  Using these simple rules, you should be able to construct a table of scaled contrast weights for any arbitrary group size k.

It must also be pointed out that there are several potential alternative sets of orthogonal contrasts. However, the rules for forming other sets of contrasts are more involved.  You can think of these different sets as looking at the problem from different angles or rotations. Each set will provide the same information although interpretation of some sets may be easier than others. The orthogonal approach is thus attractive and efficient because it requires the minimum number of comparisons needed to test all of the independent information in the data set. The chance of making a Type I error during the full set of comparisons is thus greatly reduced.

It is possible to find the approximate significance level for the entire set of k-1 orthogonal contrasts. This is given by  p = (1- alpha/(k-1)^(k-1) ~ (1-alpha). Thus if we want the significance level for the entire set to be alpha=0.5 and k = 5, then each individual contrast must be performed at the alpha/(k-1) = 0.5/4 = 0.0125 significance level. Often however, the individual contrasts are tested at alpha = 0.5 or some other level and the significance level for the entire set is allowed to float. This may not be such an unreasonable approach because the probability of committing a Type II error increases as we increase the significance level of the test.

The confidence interval for orthogonal contrasts

There are several ways in which the method of orthogonal contrasts can be applied. We will first consider the confidence interval approach.  We begin by forming a statistic l, that is
the sum of the weighted means:

l = Sum[w(i)*Xbar(i)];       where Sum[w(i)] = 0

If the values of all the means in the contrast were exactly identical, then l would equal zero. This criteria is met when the sample means come from the same true population and when the sample size n goes to infinity. In that limit, the samples means approach their expected value, E(mu)=mu, the true population mean. In such a case, the expected value of l, E(l), is zero. Thus as the null hypothesis, for our confidence interval we can state:

ho: E(l) = 0

which says that the expected value of l is zero if the various means in the contrast are identical. The alternative hypothesis is that two or more more of the means differ. To test this, we will form a confidence interval for l and see whether this confidence interval includes zero.  If the confidence interval includes zero, we fail to reject the null hypothesis at the desired level alpha.

The appropriate confidence interval for l is written:

E(l) = l +/- tc*sl

where sI = (RMSE/sqrt(n))(sqrt(sum(wi^2)))

where:

    E(l)              = the expected value of l
    l                  = sample estimate of l
    tc               = the critical t value for the appropriate alpha from the t-distribution
                          with df=N-k as determined by the MSE from the ANOVA.
    sl                 = standard error of l
    RMSE          = the square root of the mean square error (MSE) from the ANOVA test
    n                 = group sample size
    sum(wi^2)  = sum of squared contrast weights (unscaled)

The Partial F-test for orthogonal contrasts

We can approach the confidence interval method described above from a slightly different angle that will turn out to be much more informative. We will start with the confidence interval relationship described:

E(l) = l +/- tc*sl

If we let E(l) = 0, subtract l from both sides, and multiply both sides by -1 we can rearrange this as:

l = tc*sl

Now if we divide both sides by sl, we see that:

tc = l/sl

Interestingly, this has the form of a t-test and could be used to test the significance of the orthogonal contrasts.  But we can simplify the calculation by substituting for sl, then taking the square of both sides. The relationship then becomes:

t^2 = l^2/[(MSE/n)(sum(wi^2))]

Now it turns out that the square of the t distribution is in fact an F distribution with 1 degree of freedom in the numerator, and denominator degrees of freedom equal to the degrees of freedom from the t-test.  If we replace t^2 by F and rearrange the denominator we obtain:

F= [nl^2/(sum(wi^2))]/MSE = MSI/MSE

This statistic is referred to as a partial F statistic. The numerator, [nl^2/(sum(wi^2))], is the sum of squares for the contrast. It can be denoted SSl and has 1 degree of freedom. Notice that since SSI has only one degree of freedom, it is also equal to its mean square, MSI. The denominator is the MSE from the ANOVA test. We now see that the denominator degrees of freedom must be the df for the MSE from the ANOVA. Thus the degrees of freedom for the ANOVA partial F is df=1,(N-k).  This statistic is referred to as a partial F statistic because the partial F values for the full set of orthogonal contrasts sum to the ANOVA F value, F=MSTr/MSE.  Likewise, the sum of the degrees of freedom for the contrasts is (k-1) which is equal to the degrees of freedom for the MSTr. The partial F thus provides us with a means of determining how each orthogonal contrast contributes to the overall significance of the ANOVA test.

Variance explained by orthogonal contrasts

From the previous discussion, it should not be too difficult to surmise that if we can partition the significance associated with each contrast, then we can also partition the variance associated with each contrast. This is done by substituting the total sums of squares, SSTo, for the MSE in the F statistic above. The resulting ratio of sums of squares is:

r^2 = [nl^2/(sum(wi^2))]/SSTo = SSl/SSTo

Note that in this context, we write SSl, rather than MSl because we are interested in comparing sums of squares, not mean squares. The statistic r^2, referred to as the coefficient of determination, is in fact equal to the square of the correlation coefficient that we discussed in correlation analysis!