week07notes

7. Discriminant Function analysis

Determining group membership
The discriminant function model
Assumptions
Finding the discriminant function(s)
Potential problems
Determining statistical significance

Determining group membership

Many potential topics of research center around determining the group or population
to which a sample belongs. For example, we may have found multiple ash layers in a sedimentary sequence and want to differentiate their respective volcanic sources. If we measured the chemical composition of the ash this could provide the basic information to distinguish between the sources. Or from a financial standpoint, we might want to distinguish growth from value stocks. Measures of variables such as the price to earnings ratio, share price, or book value might provide some insight by which we could differentiate these two groups.

In the simplest case, measurement of a single variable might be enough to sort samples into two groups on the basis of differences in their central tendency (Figure 1). In practice, however, this is usually not the case, because the variance of the two groups may overlap significantly due to natural variability and noise (Figure 2). The obvious next step is to measure more than one variable on the same samples in the hopes that there will be less overlap between the two groups in the plane defined by the two variables (Figure 3).

Discriminant function analysis provides a means of quantifying this visual process. In the two group case where, two or more variables are manipulated to form a composite variable called a discriminant function. The methods can be generalized to allow the discrimination of more than two group. Discrimination of more than two groups requires more than one discriminant function be generated. In practice, for k groups, k-1 discriminant functions are needed to differentiate the groups. Discriminant function analysis can be thought of as a rotation of axes to find the orientation that best separates the two groups.

The discriminant function model

The discriminant function problem can be written in matrix format as:

D = Cov(X)*W

Where D represents the vector of mean differences between the centroid (multivariate center) of groups 1 and 2 measured on k variables. Cov(X) is the pooled covariance matrix for the variables in group 1 and 2. This matrix has dimensions of (k x k), and W represents a vector of k discriminant weights. The relationship can be solved for w by least squares. This amounts to minimizing the variance within a group while maximizing the distance that separates the centroids of the two groups.

Although the choice of symbols is not as standardize as for other methods, the general form of the solution to the discriminant function problem is:

ds = w_o + w₁x₁ + w₂x₂ + w₃x₃ + .. + w_kx_k

were ds is a discriminant score for each sample (sometimes denoted with an R or L), and w₁ through w_k(sometimes written as b's or lambda's) represent the discriminant weights for variables x₁ through x_k(sometimes written as psi). Despite the choice of symbols, this relationship should look very much like the generalized multiple linear regression equation!

Like regression, a model can be generated either with or without an intercept. Discriminate function analysis differs from regression analysis in one important respect. In regression we seek to find the coefficients that best relate a single y variable to one or more x variables. In discriminant function analysis, we seek to find coefficients that best define the average distance between the multivariate means of two or more groups on the basis of two or more x variables. For this reason, Kachigan describes discriminant function analysis as a regression problem with continuous x variables and a category y variable.

Once we have established a discriminant function, we can use it to sort new samples into
the appropriate groups. This is done by comparison of the discriminant scores (ds) for each sample against a critical discriminant score. The centroids for each group have discriminant scores defined as ds1 and ds2 respectively. These can be found by substituting the vector of mean values for group 1 and group 2 into the discriminant function to get ds1 and ds2 respectively. The distance between the centers of the two vectors is know as "Mahalanodis' distance" or the standardized squared distance between the groups. The critical discriminant score (dso) is defined as the midpoint between the centroids of the two groups. This is found by substituting the mean of the group means into the discriminant function.

Sorting the data on the basis of the discriminant scores is relatively easy. Samples with discriminant scores greater than dso belong to one group. Samples with discriminant scores smaller than dso belong to the other. In practice, some fraction of samples tends to be misclassified due to overlaps in the variance of the two or more groups.

Assumptions

There are several assumptions that must be met for the discriminant function analysis to be valid:

1. The random x variables must be normally distributed for each group.

2. The probability of an unknown sample belonging to any group is equal.

3. The variances and covariances of the variables in each group must be equal.

4. None of the samples used to calculate the function were misclassified.

Finding the discriminant function(s)

As with multiple linear regression, the discriminant functions can be found using forced, or stepwise procedures (forward, backward or all function methods). See the multiple linear regression notes for additional detail.

Potential Problems

The discriminant function problem suffers from drawbacks similar to that of multiple linear regression: instability of discriminant weights and colinearity (see multiple regression problems). Despite this, it does not appear to be customary to calculate confidence intervals for the discriminant weights or discriminant scores. Nor is it customary to calculate measure of error such as the rmse. The errors in the model are generally reported as percentages of misclassifications, that is members of group one assigned to group two and vice versa.

The best way to avoid potential problems with the method is to:

(1) Be aware of the degree of correlation between the x variables. Avoid using variables that are obviously highly correlated.

(2) Use a stepwise procedure for model development.

(3) Always test the model using independent data to determine if the terms in the
equation are valid. This is often referred to as a cal-val strategy. Use one data set for calibration, and a second for validation.

A third problem inherent in multiple linear regression arises from a weakness of the least squares procedure. As we add each additional term to the equation, the remaining residual variance in y must always decreases. Some of this decrease is due to a true relationship, but some of it is due to random chance. Our best defense against this problem is to use the approaches described in 2 and 3 above.

Determining statistical significance

The greater the distance between the multivariate means of the k groups, the greater the potential significance of the discriminant function. Significance of the model is based on an F-ratio in the same was as for ANOVA or regression analysis. (See the lecture notes on ANOVA analysis and Multiple Regression Analysis for additional information). The numerator of the F-ratio has k degrees of freedom where k is the number of variables, while the denominator has (N - k-1) degrees of freedom, where (N = n₁ + n₂) is the combined number of samples in both groups.