week08notes

8. Factor analysis

Introduction to Factor Analysis
R-mode Factor Analysis
Q-Mode Factor Analysis
Assumptions and Limitations

Introduction to Factor Analysis

Factor analysis is a name commonly used to describe a powerful and broad class of matrix based methods of data analysis. The factor analysis methods employed by Geologists were developed by experimental psychologists in the 1930 and are now used by a variety of researcher in numerous fields (Davis, "Statistics and Data analysis in Geology", 1982). The methods are based on manipulation of the eigenvalues and eigenvectors of a data matrix. These tools are extremely versatile and can be used to (1) extract signal from noisy data sets, (2) to simplify large data sets, and (3) to group variables or samples that share common variance or information.

Familiarity with these methods is important because they are used by researchers in fields as varied as economics, psychology, engineering, geology, and oceanography. Unfortunately because the methods are used in such a wide variety of fields, the terminology surrounding them can be obscure and confusing. This is compounded by the fact that a deep understanding of the methods requires knowledge of linear algebra, a subset of mathematics that is often overlooked in our educational system. Thus the relationships between the various methods is sometime not clearly understood and individuals unfamiliar with the techniques can be skeptical of the results. (A brief digression: I urge anyone interested in working in a technical or scientific field to learn more about linear algebra!).

We will discuss two basic types of factor analysis: variable based and sample based factor analysis. These two methods rely on a fundamental linear algebra theorem, the Eckart-Young Theorem. Specifically, the Eckart-Young Theorem states that for any real n x m matrix X, there exists the orthogonal n x r matrix V and the orthogonal m x r matrix U for which the product V'XU is a real diagonal r x r matrix L that contains the singular values of X:

L = V'XU

The index r is the rank of the data matrix X. The significance of the V matrix is that it is composed of the principle eigenvectors of the major product matrix XX', while the U matrix is composed of the principle eigenvectors of the minor product matrix X'X. This terminology will take an greater significance shortly. The two matrices share a common set of eigenvalues by which they can be related to each other. If we rearrange, the matrices to solve for X, we see that:

X= VLU'

Thus the Eckart-Young Theorem provides a means of decomposing a large data matrix into three smaller matrices that describe the fundamental structure of the data set. Because the matrix L contains the singular values of X (the square roots of the eigenvalues) this method is also know as singular value decomposition.

R-mode Factor Analysis

Variable based factor analysis is referred to as R-mode factor analysis. Principle Component Analysis (PCA) which is also know as Empirical Orthogonal Function (EOF) analysis is a related method although the data is generally normalized in a slightly different way. The objective of R-mode factor analysis is to simplify a matrix of variables by forming a smaller number of composite variables, p, that are linear combinations of the original m variables in the data set. The methods seeks to preserve the variance of the original variables in the new linear combinations (R-mode factors), and thus explain the maximum amount of variance in the data set.

R-mode factor analysis is often used when one variable is measured at multiple temporal points along one or more axes. These axes could be a physical dimension such as location, or some other physical property such as wavelength. For example, a researcher may want to compare sea surface temperature at different places in the ocean over a period of years to determine which processes generate certain spatial and temporal patterns. Or one may want to explore product sales in different regions and at different times to determine regional sales patterns. Information of this type could then be used to help refine distribution and production plans.

The R-mode terminology arises from the fact that the first important step in the method is to transform the n sample by m variable data matrix X, into the associated variable covariance or correlation matrix, R. This is done by pre-multiplying the data matrix X by its transpose:

R = X'X.

The resulting minor-product matrix is a square (m x m) matrix whose dimensions are determined by the number of variables (m) in the data set. The Eckard-Young theorem can then be used to determine the singular values (the square roots of the eigenvalues) of R. Using the matrices found by solution of the Eckard-Young theorem where R is substituted for X, it is possible to calculate the factor loadings, the new linear combinations of variables that are assumed to represent the underlying processes in the data. Using the matrices from the Eckard-Young theorem, the R-mode factor loading matrix, A^r, is defined as the product of the eigenvectors of R and its singular values:

A^r = UL

The individual element of the factor loading matrix range from -1 to 1. The sums of squares of each column in A^r can be normalized to define the percentage of variance explained by that factor. Adding these percentages together yields the cumulative variance explained by the model.

To determine how much variance each factor contributes to each sample, we can project the factors loadings back onto the data matrix to determine the R-mode factor scores:

S^r = XA^r

The factor scores range from -1 to 1. If we calculate the sums of squares for each row or sample in the factor score matrix, we obtain the sample communality. Communality ranges from 0 to 1.0 and is a linear measure of the amount of information in each sample that can be explained by the factor model. This measure is analogous to the coefficient of determination (r²) statistic that we learned of in regression and ANOVA analysis.

Using the factor loading and factor score matrices, we can define the basic R-mode model for the data set X as:

X = A^r S^r + e

This is a model of the original data matrix, X, because we choose to retain a subset p, of the m original variables and thus some variance must be lost to an error matrix e.

As an interesting aside, the step at which m is reduced to p is one of the primary differences between PCA and R-mode. In PCA, the initial solution is exact because the initial condition is to set p=m, and thus all the elements of e are 0. Once the PCA solution has been determined, the factor loading matrix (called the principle component matrix in this case) is usually truncated to retain only the largest factors (called components) that explain the majority of the variance in the data set. This approach is used because there is no a priori way to determine the number of underlying factors that may be present in the data set. Alternatively, in R-Mode factor analysis, it is customary to specify the amount of variance in the data set to explain (or alternatively, to infer the amount of "noise" in the data set) and extract as many factors as are needed to explain that fraction of variance. This cutoff threshold is often set between 90 and 99% of the total variance. This is the same as assuming uncorrelated errors that range from 10 to 1%. The R-mode and the PCA solutions are identical (aside from normalization conventions) when p=m.

Uses: R-mode factor analysis is extremely useful as a data filtering method. In the most general sense, filtering refers to any method that is designed to extract a desired signal from a data set that contains the influence of multiple processes as well as undesirable noise. Davis (1976) made use of EOF analysis to determine the underlying processes that influence sea level pressure (SLP) over the North Pacific. The original data set consisted of 336 monthly fields of SLP measured at numerous locations. EOF analysis reduced these to a subset of 6 EOF's that accounted for ~90% of the total observed variance. The remaining variance is assumed to represent processes that are either relatively unimportant during the time span of the observational data set or simply reflect observational noise.

This approach is particularly powerful when used in conjunction with other statistical methods. Earlier in the course we learned that one of the limitations of multiple linear regression (MLR) was that the use of correlated variables can lead to unstable regression parameters. One strategy for dealing with this shortcoming of MLR is to first perform an R-mode factor analysis on the data set of X variables. This allows the researcher to re-partition the variance within the data set into orthogonal functions that represent the fundamental or independent processes captured within the data set. The MLR that results from regression of the factors onto the independent (y) variable is much less likely to suffer from colinearity than a regression generated from the raw data.

Q-mode Factor Analysis

Sample based factor analysis, is referred to as Q-Mode factor analysis in Geology and as inverse factor analysis in some social sciences. Like R-mode, the objective of Q-mode factor analysis is to simplify a large matrix of variables measured on many samples. Q-mode factor analysis is often used when many variables are measured at multiple spatial or temporal points. Unlike R-mode, the Q-mode method seeks to preserve the "information" within the samples of the original data set, rather than the variance within the variables. We use the term "information" rather than variance because each sample represents a set of measurements on many variables that may differ greatly in their variance. Once the Q-mode factor scores are determined, each sample in the data set can be expressed as a linear combination of the Q-mode factors. This allows the researcher to express each sample as a linear weighting of contributions from the various factors or end-members that are determined from the data.

The Q-mode terminology arises from the fact that the first important step in the method is to transform the n sample by m variable data matrix X, into the sample similarity matrix, Q. This matrix (which is also called the cosine theta matrix) serves the same role as the covariance or correlation matrix in R-mode analysis. The cosine theta matrix provides a geometric measure of the similarity of the samples in the data set: it represent s the multivariate angular distance between samples. The smaller the angular distance between samples, the greater their similarity. The Q matrix is formed by post-multiplying the data matrix X by its transpose:

Q = XX'.

The resulting major-product matrix is a square (n x n) matrix whose dimensions are determined by the number of samples (n) in the data set. Because there are usually many more samples than variables, the Q matrix is much larger than the R matrix. As with R-mode, the Eckard-Young theorem can be used to determine the singular values (the square roots of the eigenvalues) of Q. Using the matrices found by solution of the Eckard-Young theorem (where Q is substituted for X), it is possible to calculate the factor scores, the new linear combinations of variables that are assumed to represent the underlying assemblages or groups that compose end-members acting on samples within the data set. Using the matrices from the Eckard-Young theorem, the factor score matrix A^qis defined as the product of the eigenvectors of Q and its singular values:

A^q = VL

The individual element of the factor score matrix range from -1 to 1. The sums of squares of each column in A^r can be normalized to define the percentage of variance explained by that factor. Adding these percentages together yields the cumulative variance explained by the model. Notice that this terminology is exactly the opposite of what was used in R-mode. The analogous matrix to the Q-mode factor score matrix is the R-mode factor loading matrix. While confusing, this difference arises from the use of the pre-multiplied versus post-multiplied transform of the data matrix in the different methods.

To determine how much information each factor contributes to each sample, we can project the factors scores back onto the transpose of the data matrix to determine the factor loadings:

S^q= X'A^q

The Q-mode factor loadings are thus analogous to the R-mode factor scores. The Q-mode factor loadings range from -1 to 1. If we calculate the sums of squares for each row or sample in the factor loading matrix, we obtain the sample communality. Communality ranges from 0 to 1.0 and is a linear measure of the amount of information in each sample that can be explained by the factor model. This measure is analogous to the coefficient of determination (r²) statistic that we learned of in regression and ANOVA analysis.

Using the Q-mode factor loading and factor score matrices, we can define the basic Q-mode model for the data set X as we did for R-mode:

X = S^qA^q+ e

This is a model of the original data matrix, X, because we choose to retain a subset of p factors that represents groupings of the m original variables and thus some information is lost to an error matrix e. As a general rule of thumb for a given data set, there tend to be many fewer Q-mode factors to retain than R-mode factors.

Uses: Q-mode provides an excellent means of studying how samples or objects interrelate. It can thus be used as a means of modeling sample composition as a weighted sum of contributions from several end members derived from combinations of the variables measured in the raw data matrix. The method is used in psychology to study personality traits and group them into personality types. In sociology it can be used to study how economic and social factors may interrelate. In paleoclimate research, this method is use to group environmentally sensitive species into factors or "assemblages" that are indicative of specific ecological niches. This assemblage model can then be applied to downcore samples to make inferences about how the relative importance of the various assemblages and (thus climate) changed through time.

Assumptions and Limitations

Assumptions and the quality of the solution

Because the covariance or correlation matrix is the basis of the factor analysis methods, the results are influenced by the quality of the observed correlations between variables or the similarities between samples and thus can be affected by extreme outliers or sparce data sets. As with correlation analysis the basic assumptions are that the variables are linearly related and multivariate normal in distribution. The method appears to be relatively insensitive to violation of the assumption of multivariate normality. More stringent demands of multivariate normality are needed if certain tests of statistical significance are applied when determining the number of factors to retain. Determining the number of factors to retain requires some assumption regarding either the number of processes expressed by the data or the level of noise within the data set. A more philosophical assumption behind the method is that the underlying factors extracted from the data set represent physically meaningful processes.

To rotate or not to rotate?

The sections above describe the basics of the method. Variations on the method include alternative ways of extracting factors, normalizing the data before analysis, or rotation of the factors after extraction. Rotations can be used to help make the extracted factors more easily interpreted by redistributing the variance between the factors. The most common rotations in use are the varimax and oblique rotation techniques. Others include the quartimax, and equimax rotations. The varimax rotation is particularly common in Q-mode factor analysis. It is an orthogonal rotation that tends to shift variance from negative factor scores to positive ones. This helps to make the factors more easily interpreted. Oblique rotations are not recommended by many workers. This is because there are no clear criteria for accepting one oblique rotation over another, of which there are potentially an infinite number, and perhaps more importantly because an oblique rotation will reintroduce colinearity back into the factor solution. Such a result seems counterproductive as combating colinearity is one of the primary reasons for conducting the analysis in the first place!

Determining the number of factors to retain

One of the primary limitation of the methods is that it is often difficult to determine which solution is optimal and thus how many factors to retain in the final solution. There are a number of statistical tests that have been developed, but these place stringent constraints such as strong multivariate normality on the data set which may be difficult to demonstrate or may simply not apply (Description of these methods, such as Bartlett's test of sphericity, can be found in advanced multivariate statistical textbooks). In practice, the most common rules used to determine the number of factors to retain remain subjective.

The simplest rule is the retain only factors whose eigenvalues are greater than one. The idea behind this method is that since the sums of squares are normalized to unit variance, the magnitude of the eigenvalues provides a measure of their importance relative to the original variables. A factor with an eigenvalue equal to one accounts for no more variance than an original variable and thus may not be worth retaining. A related graphical approach is to generate a semilog plot of eigenvalue rank vs. magnitude, or factor rank versus percent of total variance or information explained. This so called "scree plot" can be used to look for a break in the magnitude of the eigenvalues that separates the significant eigenvalues from the remaining ones ("the noise floor"). Another method would be to divide the data set into two or more random subset and conduct a parallel factor analysis on each part. This will provide an estimate of how stable the factors are and thus how many to retain. Finally, the factors can be "mapped out" in time or space to see if the signal associated with the factor can be identified.

Naming factors

Another relatively subjective aspect of the method arises from the names that are applied to the factors. Kachigan provides a short, but useful summary of this aspect of the method. Put simply, factors that can be named imply that they can be clearly interpreted. This can be useful as a means of synthesizing a large data set. Some workers however are skeptical of naming factors due to the inherent subjectivity and refer to each factor by its numerical rank only. Whether you choose to name a factor or not should be driven by careful attention to the original data, the resulting model, and the processes that are of likely importance.

The imperfect end member problem

One of the limitations of factor analysis is that the method can only provide sample estimates of the true end members of interest. This is true because the measurements must include errors and because it is extremely unlikely (if not impossible) to find a sample that represents an observation of a pure end member. While this limitation requires some consideration, it is no more severe that the limitation of other statistical methods which also provide sample estimates for statistics rather than true population parameters.