x

Multi-Dimensional Scaling and Cluster Analysis

Home > Multi-Dimensional Scaling and Cluster Analysis
Survey Research Suite

Guide

Dimension-Reducing Methods and Cluster Analysis

In preceding sections we have frequently mentioned the phrases “objective space” and “cluster analysis.” While our discussion emphasized the use of multidimensional scaling methods in the representation of psychological responses–namely, similarities and preferences–we also indicated from time to time that the methodology can be viewed as a means of data reduction and summarization. In this latter role, our interest is on reducing the subject’s evaluation data (evaluation of each of a set of attributes on each of a set of stimuli) into a space of fewer dimensions, with little loss in information. We can, of course, turn the problem around and consider the dual question of reducing the number of subjects by partitioning them into homogeneous groups, based on their “closeness” to one another in variable space. The first approach may be called reduced space analysis (factor analysis) and the second, cluster analysis.[#N_1_ (1)] We shall often consider the use of both approaches. In this section, we first discuss these dual problems of data reduction and clustering in the context of non-psychological data. After describing some of the basic characteristics of the data matrix we discuss both metric and non-metric approaches to reduced space representation. We then turn to the topic of cluster analysis and describe the major characteristics of clustering procedures, including the choice of association coefficient, types of grouping algorithms, cluster description, and problems of statistical inference.The concluding section discusses some recent developments in reduced space and cluster analysis, particularly an overlapping clustering technique that appears particularly useful in product positioning where products/subjects may belong to several clusters simultaneously.

The Data Matrix
The key concept underlying the utilization of reduced space and cluster analysis is that in most real world situations the multiple measurements collected on a set of subjects are partly redundant, i.e., correlated in some way. The dual of this assertion is that some sets pairs of subjects, when considered across variables, are similar to each other. Geometrically, this could be observed by putting N subjects in N-1 dimensional space. We would generally not find all points equidistant from each other, thus indicating that certain sets of points are correlated. While one can often take advantage of natural association to make predictions as, for example, in regression analysis or discriminant analysis, situations occur in which no variable can be singled out as a dependent or criterion variable. In these situations, we may still wish to summarize the information provided by a whole set of subjects-by-stimuli “scores”. We would search for some more parsimonious structure which removes redundancy in the original set of data while preserving most of the information contained in the original data matrix. In either case, the “raw input” to any such analysis consists of the data matrix. The data matrix is here assumed to be a rectangular matrix of numerical entries whose informational content is to be summarized and portrayed in some way. For example, the computation of the mean and standard deviation of a unidimensional array is often done simply because we are unable to comprehend the meaning of the entire column of values. In so doing we often (willingly) forego the full information provided by the data in order to understand some of its basic characteristics, e.g., central tendency and dispersion. The problem becomes even more complex when we have multiple measurements on a set of objects. Table 5-1 shows a conceptual illustration of a data matrix. We note that the array consists of a set of subjects (the N rows) and a set of variables (the n columns). The (i,j)th cell entry represents the value of subject i on variable j. The subjects may be people, objects, concepts, or events. The variables are characteristics, properties, attributes or some other stimuli descriptive or related to the subjects. Cell values may consist of nominal, ordinal, interval, or ratio-scaled measurements or various combinations of these as we go across columns. The complete (row) vector of values is often called a subject’s profile. In many cases the investigator is able to partition the data matrix into subsets of columns (or rows) on the basis of prior information. For example, suppose the first column of Table 5-1 is the elapsed time for an automobile to travel from 0 to 100 KPH. and the other columns consist of various automobile performance and physical dimensions for the N automobiles. The researcher may wish to predict quickness or elapsed time from some linear combination of the n-1 remaining variables. If so, prior judgment is used to describe how the dependency is to be explained. In this instance, we would probably use a multiple regression of horsepower, torque, weight, and tire size to establish the hypothesized functional relationships. Quite frequently, however, we may have no reasonable basis for prior partitioning of the data matrix into criterion (dependent) or predictor (independent) variables. The purpose may merely be to group subjects into “similar” subsets (based on their similarity over the whole profile of variables) or to portray the columns of the data matrix in terms of a smaller number of new variables (e.g., linear combinations of the original set) which retain most of the information in the original data matrix. These are the cases with which this section is concerned. Both classes of procedures start with the data matrix or some set of measures derived from it.
Reduced Space Analysis
In earlier sections we saw how a set of association measures (similarities data, for example) could be “dimensionalized” by multidimensional scaling techniques. We also saw how a set of stimuli in variable space could be converted into a set of “association” measures (e.g., distances) between each stimulus pair. In general, we can go from an association matrix to a dimensional representation and vice versa. In reduced space analysis we usually, though not necessarily, start off with a data matrix like that conceptualized in Table 5-1. The objective here is to transform this matrix into one having the same number of rows but fewer columns, without serious loss of information.In finding this reduced space representation, however, we may go through the seemingly roundabout process of:– finding correlation measures;

  • rotating of the initial configuration of points (objects to a new orientation of the same dimensionality). This rotated configuration exhibits the characteristic of mutually orthogonal dimensions with sequentially maximal variance. That is, the second dimension displays the next largest variance, subject to being orthogonal to the first, and so on;
  • reducing the dimensionality of this transformed space, usually by discarding those higher dimensions that exhibit the smallest variance of point projections;
  • finding still a new orientation of the reduced space that makes the retained dimensions more interpretable from a content point of view;
  • interpreting the reoriented dimensions in terms of the variables that show high association with each dimension [Green 1978].

Historically, the methods of factor analysis have been used for a variety of purposes [Rummel 1970]:– reducing the dimensionality of a variable space;

  • scaling and the spatial representation of perception and preference data;
  • untangling complex patterns of intervariable association in multivariate data;
  • exploratory research in the identification of latent characteristics for future experimentation;
  • developing empirical typologies of variables;
  • developing a data-based unidimensional index that maximally separates individuals;
  • testing hypothesized relationships between certain variables in specific content areas;
  • transforming the matrix of predictor variables prior to applying some other technique like multiple regression or canonical correlation.

We shall briefly discuss one type of factoring–principal components factor analysis–as the major metric technique for reducing the dimensionality of space.

METRIC REDUCTION–Principal Components Analysis

Factor analysis reduces the dimensionality of a data set to produce a set of underlying dimensions, which are often referred to as components, or factors. Factor analysis then seeks a linear combination of the original variable scores–assumed here to be at least interval scaled that displays maximum variance if the subject points are projected on to it. Once this is done, we seek a second linear combination, axis or underlying dimension that is orthogonal to the first and maximizes residual variance explained. This process continues until all of the variance in the original data is explained. Having done this, we disregard those higher dimensioned linear combinations that account for small amounts of variance. As we construct the linear combinations, any set of weights, in the linear combination, same or different over each column, plus or minus, might suffice. As a matter of fact, the various types of factoring methods are differentiated in terms of the bases upon which the weights defining the factor are selected. Factor scores are obtained by merely finding for each subject the numerical value of the linear combination obtained by substituting original data values in the linear equation of, say, factor one:

F1 = a1X1 + a2X2 +…+ anXn

where the aj are the factor weights.

If we then correlate each of these new columns of the (subject x factor) factor score matrix with the original variables, we obtain a set of factor loadings. A factor loading is defined simply as the correlation (across subjects) of an original variable with a factor. Again, if the method of principal components is used, the weights are chosen in such a way that the successive principal components will account for a maximal portion of total variability in the data, subject to being mutually orthogonal to all previously obtained linear combinations. That is, the first set of weights will lead to a set of component scores which account for maximum variance. Each successive component will account for a decreasing portion of total variance in the original set of data and be orthogonal to all previously existing components. Some of the geometric aspects of principal components analysis are shown in Figures 5-1 and 5-2. We assume that the original data matrix consists of the measurements of N subjects on three variables. We note from Figure 5-1 that the positions of the actual subject data (from Table 5-1) are portrayed as a somewhat elliptical scatterplots of points. Because variables Xs1 and Xs2 are not perfectly correlated, the plot of points forms an ellipse, rather than a straight line. Figure 5-2 (a) depicts this more clearly with the principal components being represented by the new set of axes (Zi) drawn through the major and minor axes of the ellipse.Z1, the first principal axis, is the major axis of the ellipse. However, it does not exhaust all of the information in the data, necessitating a second axis Z2. Z2 is the minor axis of the ellipse. We note that most of the information would be preserved if only Z1 were retained. Figure 5-2(b) shows the analogous case for three dimensions in which a concentration ellipsoid describes the point scatter. Figure 5-2(b) would appear as a flattened balloon because length (Z1), width (Z2), and height (Z3) have less and less variance (dispersion) to account for. Figure 5-2(c) reflects the situation where the variables are uncorrelated in three dimensions. The scatterplot appears as a sphere, where each of the dimensions is orthogonal to each other.

FACTORING THE ASSOCIATION MATRIX

In practice, the factors are obtained not by factoring the data matrix itself, but a set of association measures–cross products, covariances, correlations–that are derived from the data matrix. (See Table 5-1.) If as many components are extracted as there are variables, the original association matrix (e.g., matrix of correlation coefficients) can be perfectly reproduced by a matrix product of the factor loadings. Of course, in this case no parsimony would be obtained, since one would have as many linear combinations (components) of the variables as there were variables to begin with.

INSERT Figure 5.1 Scatter Plots of Standardized Data of Table 5.1 and the First Principal Component z1

INSERT Figure 5.2 Geometric Aspects of Principal Components Analysis

METRIC APPROACH TO REDUCED SPACE ANALYSIS

In terms of our emphasis on reduced space techniques, principal components analysis represents the major metric procedure for reducing the original variable space. In the usual case where a (variable by variable) correlation matrix serves as the starting point for the component analysis (R-type factor analysis), one can compute unit-variance (factor) scores for each object on each component. If weighted squared distances in component space are computed between object pairs, using component eigenvalues as weights, such measures will be proportional to squared interpoint distance in the original variables space if all components are extracted.

NONMETRIC REDUCED SPACE ANALYSIS

In general, if the function relating interpoint distance between pairs of points to input data is nonlinear but monotonic, a metric procedure like principal components analysis will lead to a larger number of dimensions in which to portray the point locations in terms of orthogonal axes than would be the case with linear functions. However, if we replace the objective of attempting to reconstruct the actual values of the input matrix with the objective of trying to reproduce only their rank order, we will obtain a space of fewer dimensions. Moreover, it turns out that multidimensional scaling techniques discussed in this book can be used as types of nonmetric factoring procedures. The input data can consist of a symmetric matrix of any type of association measure whose values can be weakly ordered and obey the metric axioms discussed in Section 2. In addition to nonmetric programs that find a dimensional representation of a single set of points (e.g., objects) whose ranks of interpoint distances are monotonic with the original input data, other approaches leading to joint spaces can be employed. For example, the vector and unfolding models described in Section 4 can be used for the same general objective, depending upon the type of representation desired.

Cluster Analysis
Cluster analysis, like reduced space analysis, is concerned with data matrices in which the variables have not been partitioned beforehand into criterion versus predictor subsets. In reduced space analysis our interest centered on reducing the variable space to a smaller number of orthogonal dimensions which maintained most of the information–metric or ordinal– contained in the original data matrix. Emphasis was placed on the variables rather than on the subjects (rows) of the data matrix. In cluster analysis our concern is with the similarity of the subjects–that is, the resemblance of their profiles over the whole set of variables. These variables may be the original set or may consist of a representation of them in reduced space. In either case the objective of cluster analysis is to find similar groups of subjects, where “similarity” between each pair of subjects is usually construed to mean some global measure over the whole set of characteristics–either original variables or derived coordinates, if preceded by a reduced space analysis. In this section we discuss various methods of clustering and the key role that distance functions play as measures of the proximity of pairs of points. We first discuss the fundamentals of cluster analysis in terms of major questions concerning choice of proximity measure, choice of clustering technique, and descriptive measures by which the resultant clusters can be defined. We show that clustering results can be sensitive to the type of distance function used to summarize proximity between pairs of profiles.We next discuss the characteristics of various computer programs that have been proposed for grouping profiles, i.e., for partitioning the rows (subjects) of the data matrix. This is followed by brief discussions of statistics for defining clusters and the problems associated with statistical inference in this area.

BASIC QUESTIONS IN CLUSTER ANALYSIS

The most common use of cluster analysis is for classification. That is, subjects are separated into groups such that each subject is more like other subjects in its group than it is to subjects outside the group. Cluster analysis is thus concerned ultimately with classification and represents a set of techniques which are part of the field of numerical taxonomy (Frank and Green, [1968]; Punj and Stewart [1983]; Aldenderfer and Blashfield [1984]). We will initially focus on clustering procedures that result in the assignment of each subject to one and only one class. Subjects within a class are usually assumed to be indistinguishable from one another. Thus, we assume that the underlying structure of the data involves an unordered set of discrete classes. In some cases we may also view these classes as hierarchical in nature, with some classes divided into subclasses. Clustering procedures can be viewed as “preclassificatory” in the sense that the researcher has not used prior judgment to partition the subjects (rows of the data matrix). However, it is assumed that some of the objectives are heterogeneous; that is, that “clusters” exist. This presupposition of different groups is based on commonalities within the set of independent variables. This assumption is different from that made in the case of discriminant analysis or automatic interaction detection, where the dependent variable is used to formally define groups of objects and the distinction is not made on the basis of profile resemblance in the data matrix itself. Thus, given that no information on group definition is formally evaluated in advance, the major problems of cluster analysis will be discussed as follows:

  1. What measure of intersubject similarity is to be used and how is each variable to be “weighted” in the construction of such a summary measure?
  2. After intersubject similarities are obtained, how are the classes to be formed?
  3. After the classes have been formed, what summary measures of each cluster are appropriate in a descriptive sense; that is, how are the clusters to be defined?
  4. Assuming that adequate descriptions of the clusters can be obtained, what inferences can be drawn regarding their statistical significance?

CHOICE OF PROXIMITY MEASURE

The choice of proximity, similarity, association, or resemblance measure (all four terms will be used synonymously here) is an interesting problem in cluster analysis. The concept of similarity always connotes the question: similarity with respect to what? Proximity measures are usually viewed in relative terms–two objects are similar, relative to the group, if their profiles across variables are “close” or they share “many” aspects in common, relative to those which other pairs share in common. Most clustering procedures use pairwise measures of proximity. The choice of which subjects and variables to use in the first place is largely a matter for the researcher’s judgment. While these (prior) choices are important ones, they are beyond our scope of coverage. Even assuming that such choices have been made, however, the possible measures of pairwise proximity are many. Generally speaking, these measures fall into two classes: (a) distance-type measures (including correlation coefficients); and (b) matching-type measures. The characteristics of each class are discussed in turn.

DISTANCE-TYPE MEASURES

A surprisingly large number of proximity measures–including correlation measures–can be viewed as distances in some type of metric space. In Section 2 we introduced the notion of Euclidean distance between two points in a space of r dimensions. We recall that the formula was:File:Img.gifwhere xij, xjk are the projections of points i and j on dimension k; (k = 1,2,…,r). Inasmuch as the variables are often measured in different units, the above formula is usually applied after each variable has been standardized to mean zero and unit standard deviation. Our subsequent discussion will assume that this preliminary step has been taken. The Euclidean distance measure assumes that the space of (standardized) variables is orthogonal, i.e., that the variables are uncorrelated. While the Euclidean measure can still be used with correlated variables, it is useful to point out that (implicit) weighting of the components underlying the associated variables occurs with the use of the Euclidean measure:

  1. Squared Euclidean distance in the original variable space has the effect of weighting each underlying principal component by that component’s eigenvalue.
  2. Squared Euclidean distance in the component space (where all components are first standardized to unit variance) has the effect of assigning equal weights to all components.
  3. In terms of the geometry of the configuration, in the first case all points are rotated to orthogonal axes with no change in squared interpoint distance. The general effect is to portray the original configuration as a hyperellipsoid with principal components serving as axes of that figure. Equating all axes to equal length has the effect of transforming the hyperellipsoid into a hypersphere where all “axes” are of equal length.

The above considerations can be represented in terms of the following squared distance model:File:Img1.gifwhere: yik, yjk denote unit variance components of profiles i and j on component axis k (k = 1,2,…,r). If one weights the component scores according to the variances of the components (before standardization) the expression is:File:Img2.gifwhere is the k-th component variance, or eigenvalue. This expression is equivalent to d2ij expressed in original variable space. The above relationships assume that all principal components are extracted. As described earlier, if such is not the case, squared interpoint distances will be affected by the fact that they are computed in a component space of lower dimensionality than the original variable space. In summary, both the Euclidean distance measure in original variable space and the Euclidean distance in component space (assuming all components have been extracted) preserve all of the information in the original data matrix. Finally it should be pointed out that if (in addition to being standardized to mean zero and unit variance) the original variables are uncorrelated, both d2ij and *d2ij will be equivalent.

OTHER EUCLIDEAN DISTANCE MEASURES

Two other measures have often been proposed as proximity measures. Both of these measures derive from historical clustering methods which used Q-type factor analysis to cluster subjects. In Q-type factor analysis–as described briefly in our discussion of reduced space analysis–the correlation (or covariance) matrix to be factored consists of intersubject rather than intervariable proximities. In these methods the weights k are left intact.The effect of Q-type component analysis of either covariance or correlation matrices, has been shown by Cronbach and Gleser [1953], to reduce the dimensionality of the space underlying the computation of proximity measures. Both procedures reduce the dimensionality of the original space to one less dimension by equating all profiles to zero mean. As such, profile differences in elevation are removed. In addition, a Q-type analysis applied to the intersubject correlation matrix will remove interprofile variation due to differences in dispersion. The result is projection of points representing each profile into two fewer dimensions.Figures 5-3 and 5-4 show these effects geometrically. In the case of either covariance or correlation matrices the profile mean is subtracted from each vector component which, in the 2-component case of Figure 5-3, results in a centroid with (new) origin located at point X on the figure. Figure 5-4 shows the effect of removing profile dispersion. If we assume that the profiles were originally positioned in three-space, removal of each profile’s mean reduces their dimensionality to two-space. By using a correlation matrix we further reduce dimensionality by projecting all points on to the unit circle, since the standard deviation of a profile can be represented by the distance of the profile point from the origin (the centroid of the points first having been translated to the origin). Thus, profiles a, b, and c would all be identical after the transformation, as would profiles d and e. The cosine of the angle separating the two vectors represents the Q-correlation between them.Of course, there may be cases where the researcher is not interested in profile differences due to either mean and/or dispersion. If so, a Q-type analysis applied to covariance or correlation matrices, as the case may be, is perfectly sensible even though information is (willingly) discarded. In general, however, we should expect differences in the derived squared distance measure computed from these procedures–both between themselves and between those computed by the techniques previously discussed. While the authors have a predilection for the information-preserving measures d2ij and *d2ij, it is well to point out that no universally applicable distance measure exists. The choice of which measure to use depends upon which aspect of the data is worth “preserving.” A wide variety of distance-type measures are available for cluster analysis; several of which are compared by Aldenderfer and Blashfield (1984). Once the researcher has selected a method of measuring pairwise profile similarity, the computational routine for clustering the subjects must be selected. Aldenderfer and Blashfield identify several families of clustering methods, each of which uses a different approach to creating groups: (1) hierarchical agglomerative; (2) hierarchical divisive; (3) factor analytic; (4) non-hierarchical.

HIERARCHICAL METHODS

These procedures are characterized by the construction of a hierarchy or tree-like structure. In some methods each point starts out as a unit (single-point) cluster. At the next level the two closest points are placed in a cluster. At the next level a third point joins the first two or else a second two-point cluster is formed, based on various criterion rules for assignment. In application, hierarchical clustering is useful in determining if points are substitutable rather than mutually exclusive.

(INSERT Figures 5-2, HERE)

(INSERT Figure 5-3 HERE)

The different assignment rules include:

Single Linkage [#N_2_ (2)]

The single linkage or minimum distance rule starts out by finding the two points with the minimum distance. These are placed in the first cluster. At the next stage a third point joins the already-formed cluster of two if the minimum distance to any of the members of the cluster is smaller than the distance between the two closest unclustered points. Otherwise, the two closest unclustered points are placed in a cluster. The process continues until all points end up in one cluster. The distance between two clusters is defined as the shortest distance from a point in the first cluster that is closest to a point in the second.

Complete Linkage

The complete linkage option starts out in just the same way by clustering the two points with the minimum distance. However, the criterion for joining points to clusters or clusters to clusters involves maximum (rather than minimum) distance. That is, a third point joins the already formed cluster of two if the maximum distance to any of the members of the cluster is smaller than the distance between the two closest unclustered points. In other words, the distance between two clusters is the longest distance from a point in the first cluster to a point in the second cluster.

Average Linkage

The average linkage option starts out in the same way as the other two. However, in this case the distance between two clusters is the average distance from points in the first cluster to points in the second cluster.

WARD’S METHOD

Ward’s Method starts out by finding two points with the minimum within groups sum of squares. Points continue to be joined to the first cluster or to other points depending on which combination minimizes the error sum of squares from the group centroid. This method is also known as a k-means approach. Closely related to Ward’s algorithm is the Howard-Harris algorithm [Harris, 1981]. The Howard-Harris algorithm is a hierarchical divisive method which uses the k-means method of assigning cases to the clusters. The k-means method assigns the case to the closest centroid. The approach may take either of the two forms described below:

Approach #1

  1. Initially the entire set of observations is considered as one set. The group is split based on the one variable which makes the greatest contribution to within-group sum of squares.
  2. Group centroids are re-computed and subject distances to all group centroids are computed. The subject that would best improve the objective function is re-assigned. This process is repeated until either a finite number of transfers are performed, no further improvement in within-groups sum of squares is found, or a local optimum is reached.
  3. The group with the largest within-groups sum of squares is selected for splitting. Steps 2 and 3 are then repeated until the desired number of clusters are identified.

Approach #2

  1. A m x m covariance matrix is formed and analyzed using principal components analysis. A factor score is computed for each of the n subjects on the first (and most important) dimension or factor.
  2. All subjects with factor scores that exceed the mean value of the factor are assigned into a new cluster.
  3. After splitting, each observation is re-evaluated against all clusters. If the objective function is improved by re-assigning a case to another cluster, the case making the greatest improvement is re-assigned. Optimization continues until
    • a finite number of transfers are performed, b) no further improvement in the objective function is found, or c) a local optimum is reached.
  4. The next factor is selected as the basis for splitting the next cluster. Steps 2, 3, and 4 are then repeated until the desired number of clusters are identified.

Hierarchical clustering algorithms may be identified as either hierarchical agglomerative or hierarchical divisive, meaning that they contract or expand the space between groups of points in the multivariate space. Divisive linkage methods tend to start new clusters rather than join points to existing clusters. Ward’s method and complete linkage rules are of the divisive variety and tend to create clusters of roughly equal size that are hyperspherical in form. The average linkage method neither expands nor contracts the original space, while the single linkage tends to agglomerate or contract the space between groups of points in multivariate space.

FACTOR ANALYTIC METHODS

These methods analyze an n x n correlation matrix of similarities between the n cases to find a dimensional representation of the points. Clusters are then developed based on the resulting factor loadings (the correlation between the subject and the underlying dimension). The use of factor analytic methods for clustering have been criticized because of the use of a linear model that is developed across cases rather than across variables. The linear model tends to moderate the correct identification of anything other than linear additive predictive groups.

NON HIERARCHICAL METHODS

Non Hierarchical Methods have in general been subject to limited use and testing, making their specific operational characteristics difficult to identify. In general, however, these methods start right from a proximity matrix and work in the following fashion (Aldenderfer and Blashfield, 1984):

  1. Begin with an initial split of the data into a specified number of clusters.
  2. Allocate each data point to the cluster with the nearest centroid.
  3. Compute the new centroid of the clusters after all points are assigned to clusters.
  4. Iterate steps 2 and 3 until no further changes occur.

Several general types of non-hierarchical clustering designs exist and can be characterized:

  • Sequential Threshold–in this case a cluster center is selected and all objects within a prespecified threshold value are grouped. Then a new cluster center is selected and the process is completed. Once points enter a cluster they are removed from further processing.
  • Parallel Threshold–this method is similar to the one immediately above except that several cluster centers are selected in advance and points within the threshold level are assigned to the nearest center; threshold levels can then be adjusted to admit fewer or more points to clusters.
  • Parallel Partitioning–this method is similar to the one immediately above except that once several cluster centers are chosen, the whole set of data is partitioned into disjoint sets based on nearest distance to cluster centers being within threshold distance of still other objects.

MATCHING-TYPE MEASURES

Quite often the analyst wishing to cluster profiles must contend with data that are only nominal scaled, in whole or in part. While dichotomous data, after suitable transformation, can often be expressed in terms of interpoint distances, the usual approach to nominal-scaled data uses attribute matching coefficients. Intuitively speaking, two profiles are viewed as similar to the extent to which they share common attributes. As an illustration of this approach, consider the two profiles appearing below:

ATTRIBUTE {| width=”44%” cellpadding=”5″ |- valign=”top” | width=”100%” | Object 1 2 3 4 5 6 |- valign=”top” | 1 1 0 0 1 1 0 |- valign=”top” | 2 0 1 0 1 0 1 |}Each of the above objects is characterized by possession or non-possession of each of six attributes, where a “1″ denotes possession and “0″ denotes non-possession. Suppose we just count up the total number of matches–either 1, 1 or 0, 0–and divide by the total number of attributes. A simple matching measure could then be stated as:

S12 = M/N = 2/6 = 1/3 where M denotes the number of attributes held in common (matching 1′s or 0′s) and N denotes the total number of attributes. We notice that this measure varies between zero and one. If weak matches (non-possession of an attribute) are to be de-emphasized, the above measures can be modified to:File:Img3.gif. In this case, Sij = 1/5. A variety of such matching type coefficients are described by Sneath and Sokal [1973] and Everett [1980]. Attributes need not be limited to dichotomies, however. In the case of unordered multichotomies, matching coefficients are often developed by means similar to the above by recoding the k-state variables into k-1 dummy (zero-one) variables. Naturally such coefficients will be sensitive to variation in the number of states in each multichotomy.

Finally, mention should be made of the case in which the variables consist of mixed scales–nominal, ordinal, and interval. Interval-scaled variables may be handled in terms of similarity coefficients by the simple device of computing the range of the variable Rk and finding:File:Img4.gif. This measure has been suggested by Gower [1967] as a device to handle both nominal- and interval-scaled data in a single similarity coefficient. Mixed scales which include ordinal-scaled variables present greater difficulties. If ordinal and interval scales occur, one can downgrade the interval-scaled data to ordinal scales and use nonmetric procedures. If all three scales–nominal, ordinal, and interval–appear, one is more or less forced to downgrade all data to nominal measures and use matching type coefficients. An alternative approach would be to compute “distances” for each pair of objects according to each scale type separately, standardize the measures to zero mean and unit standard deviation, and then compute some type of weighted association measure. Such approaches are quite ad hoc, however. While the above classes of clustering algorithms are not exhaustive of the field, most of the currently available routines can be typed as falling into one (or a combination) of the above categories. Criteria for grouping include such measures as average within-cluster distance and threshold cut-off values. The fact remains, however, that even the “optimizing” approaches generally achieve only conditional optima, since an unsettled question in this field is how many clusters to form in the first place. The recommendation of an algorithm is difficult at best. For classification purposes, clusters should be able to identify distinct separations between different clusters of items. Clusters should also be internally consistent. Because meeting these challenges is often a function of the type of data analyzed, selection of an optimal algorithm is also a function of the characteristics of the data. Overall, the k-means clustering technique appears to perform well (see Punj and Stewart 1983) when the initial starting configuration is non-random. In situations where a random starting configuration is required, the minimum variance type of algorithm often performs well. It is even suggested that clustering might best be approached using a combination of reduced space analysis and clustering techniques, so as to group points in the space obtained from principal components or nonmetric scaling techniques. This approach is particularly beneficial if the number of dimensions is small, allowing the researcher to augment the clustering results with visual inspection of the configuration.If the researcher is more concerned with structure than classification, overlapping clustering ignores the concept of distinct separations between clusters in an attempt to allow products/subjects to belong to more than one cluster.

DESCRIBING THE CLUSTERS

Once clusters are developed, the researcher still faces the task of describing the clusters. One measure which is used frequently is the centroid; that is, the average value of the objects contained in the cluster on each of the variables making up each object’s profile. If the data are interval scaled and clustering is performed in original variable space, this measure appears quite natural as a summary description. If the space consists of principal components’ dimensions obtained from nonmetric scaling methods, the axes usually are not capable of being described simply. Often in this case the researcher will want to go back to the original variables and compute average profile measures in these terms. If matching type coefficients are used, the cluster may be described by the group’s modal profile on each of the attributes; in other cases, arithmetic averages may be computed. In addition to central tendency, the researcher may compute some measure of the cluster’s variability, e.g., average interpoint distance of all members of the cluster from their centroid or average interpoint distance between all pairs of points within the cluster.

STATISTICAL SIGNIFICANCE

Despite attempts made to construct various tests of statistical significance of clusters, current statistical tests are little more than heuristics offering relatively indefensible procedures. The lack of appropriate tests stems from the difficulty of specifying realistic null hypotheses. First, it is not clear just what the universe of content is. Quite often the researcher arbitrarily selects objects and variables and is often interested in confining attention to only this sample. Second, the researcher is usually assuming that heterogeneity exists in the first place–otherwise, why bother to cluster? Moreover, the clusters are formed from the data and not on the basis of outside criteria. Thus, one would be placed in the uncomfortable statistical position of “testing” the significance between groups formed on the basis of the data itself. Third, the distributions of objects are largely unknown, and it would be dangerous to assume that they conformed to some tractable model like a multivariate normal distribution.It is indeed likely that different types of clusters may be present simultaneously in the data. It is a major difficulty to specify a-priori the type of clustering or homogeniety to be detected. These limitations in mind, Arnold (1979) proposed using a statistic originally suggested by Friedman and Rubin (1967). The statistic, given byFile:Img5.gif where |T| is the determinant of the total variance-covariance matrix and |W| is the determinant of the pooled within-groups covariance matrix. We continue to believe that, at least in the present state of cluster analysis, the objective of this class of techniques should be to formulate rather than test categorizations of data. After a classification has been developed and supported by theoretical research and subsequent reformulation of classes, other techniques like discriminant analysis might prove useful in the assignment of new members to groups identified on grounds which are not solely restricted to the original cluster analysis. While the above caveats are not to be taken lightly, clustering techniques are useful–in ways comparable to the objectives of factor analysis–as systematic procedures for the orderly preclassification of multivariate data. The results of using these approaches can be helpful and meaningful (after the fact) as will be illustrated next.

An Application of Reduced Space and Cluster Analysis

Thus far, our descriptions of reduced space and clustering methods have largely remained at the conceptual level. In this section of the section we describe their application to a realistically sized problem–one dealing with the similarities and differences among 90 1987 automobiles, trucks and utility vehicles whose prices range from $5,000 to $168,000. In this abridged version of the study, we illustrate the use of cluster analysis to 90 vehicles. Figure 5-5 identifies the 90 vehicles for which data was collected on 20 attributes. This 90 vehicle x 20 variable data matrix forms the basis for our analysis directed at grouping the vehicles according to similarity of attributes.

Application of the Howard Harris procedure yielded two different clustering solutions which, based on the within-group sum of squares for each group, appeared to be worth examining. Figure 5-6 shows the “Scree”-type diagram plotting sum of squares against number of clusters. The curve appears to flatten at 5 clusters and again at 12 clusters.The 5-cluster solution is shown in Figure 5-7 and the 12-cluster solution is shown in Figure 5-8. As might be surmised, the 5 cluster representation was inferior to the 12 cluster representation.

Vehicle Performance Characteristics and Listing

Vehicles

5 Cluster Solution

12 Cluster Solution

In Figure 5-6, we note that cluster membership is somewhat more evenly distributed than in Figure 5-7 (twelve) clusters. The groups are rather homogeneous, though now and again, a vehicle seems to be out of place.From Tables 5-6 and 5-7 one can get some idea of the current intermanufacturer competition. Market positioning strategies seem to be well developed, with several of the manufacturers having multiple vehicles within the same segments. This product positioning is even more apparent when one recognizes that these are the major model differences and that minor options/product distinctions are present for most vehicles.

SUMMARY OF STUDY

The foregoing results constituted only one of several possible facets of this study. Additional analytical steps may have involved: (a) the development of clusters based only on the nominal-scaled (features) data; (b) the development of clusters based only on the interval-scaled data; and (c) clustering (involving both features and measured data) on a combined time period basis. In terms of substantive results, we found that five “clusters” explained most of the similarities and differences among the vehicles models–VW Vanagon, smaller 4-5 passenger vehicles and wagons, exotic sports and large capacity passenger cars and utility vehicles, and popular sports cars and pickups. Of course, the clusters became more detailed as the number of clusters increased. The resulting clusters indicated which manufacturers competed with which other manufacturers in terms of similarity in the performance profiles of their vehicles. For purposes of this section, suffice it to say that clustering techniques can be used in marketing studies involving large-scale data banks. Moreover, the combination of reduced space (principal components) and cluster analysis can provide a useful dual treatment of the data. The reduced space phase may provide help in summarizing the original variables in terms of a smaller number of dimensions, e.g., speed or cargo capacity. The clustering phase permits one to group vehicles according to their coordinates in this reduced space.

Recent Developments in Clustering Technique

Our previous discussion of clustering analysis has tended to emphasize the tandem approach of dimensional and nominal (class-like) representation of data structures. In addition to using multidimensional scaling techniques for reduced space analysis, a number of other nonlinear approaches have been developed, including nonlinear factor analysis [McDonald, 1962], polynomial factor analysis [Carroll 1969], correspondence analysis [Carroll, Green and Schaffer, 1986]. Space does not permit anything but brief mention of this interesting work. We do consider in some detail, however, a combination qualitative-quantitative approach to an important problem in reduced space analysis–the interpretation of data structures.

NOMINAL VS. DIMENSIONAL STRUCTURES

As mentioned earlier, even a pure class structure–where class membership accounts for all of the information in the data–can be represented spatially. More commonly, however, we consider cluster analysis as a more appropriate technique for characterizing such data. On the other hand, other data structures are inherently dimensional, so that measures of proximity are assumed to be able to vary rather continuously throughout the whole matrix of proximities. Pure typal and pure dimensional structures represent only two extremes. Since all proximity matrices (that obey certain properties [Gower, 1966]) can be represented spatially, it would seem of interest to consider data structures in terms of the restrictions placed on the points as they are arranged in that space. This motivation underlies many of the most recent developments in cluster analysis.Torgerson [1965] was one of the first researchers to become interested in the problem of characterizing data as “mixtures” of discrete class and quantitative variables. Several varieties of such structures can be obtained:

  1. Data consisting of pure and unordered class structure. Dimensional representation of such data would consist of points at the n vertices of an n-1 dimensional simplex where interpoint distances are all equal. For example, three classes could be represented by an equilateral triangle in two-space, four classes by a regular tetrahedron in three-space, and so on.
  2. Data consisting of concentrated masses of points, corresponding to classes, where interclass distances are unequal, thus implying the existence of latent dimensions underlying class descriptions.
  3. Data consisting of hierarchical sets of attributes where some classes are nested within other classes, e.g., cola and non-cola drinks within the diet-drink class.
  4. Data consisting of dimensional variables nested within discrete classes, e.g., sweet to non-sweet cereals within the class of “processed” shape (as opposed to “natural” shape) cereals.
  5. Data consisting of mixtures of ideal (mutually exclusive) classes so that one may find, for example, points in the interior of an equilateral triangle whose vertices represent three unordered classes.
  6. Data consisting of pure dimensional structure in which, theoretically, all of the space can be filled up by points. Insert Figure 5-7. While the above categorizations are neither exclusive nor exhaustive, they are illustrative of the variety of data structures that could be obtained in the analysis of “objective” data or subjective (similarities) data of the sort described in the preceding sections. From the viewpoint of cluster analysis, some of the above structures could produce elongated, parallel clusters in which average intracluster distance need not be smaller than intercluster distances. Moreover, one could have structures in which the clusters curve or twist around one another along some manifold embedded in a higher dimensional space [Shepard and Carroll, 1966]. Figure 5-8 shows three types of data structures as related to the above categories [Torgerson, 1965]. The first panel illustrates the case of three unordered discrete classes. The second panel illustrates the case of discrete class structure where class descriptors are assumed to be orderable. The third panel shows the case of three discrete classes and an orthogonal variable which is quantitative. Points occur only along the solid lines of the prism. The fourth panel illustrates the case where objects are made up of mixtures of discrete classes plus an orthogonal quantitative dimension. In this case all objects lie on or within the boundaries of the curve prism while “pure” cases would lie at one of the three edges with location dependent upon the degree of the quantitative variable which each possesses.Research in cluster analysis and related techniques is proceeding in new directions for dealing with heretofore intractable data structures. The continued development and refinement of interactive display devices should further these efforts by enabling the researcher to “visualize” various characteristics of the data array as a guide to the selection of appropriate grouping methods.

OVERLAPPING CLUSTERING TECHNIQUES

The key element of all clustering techniques discussed so far is the mutually exclusive and exhaustive nature of the clusters developed. While in most cases, managers view segments as mutually exclusive and hierarchical in nature, cases do exist where segments are mutually exclusive. Indeed, consumers may well fit into several segments. Overlapping clustering is a new clustering model which relaxes the exclusivity constraint of most other hierarchical and non-hierarchical cluster models. As an example of a cluster analysis of brands of soft drinks, Tab may be perceived as fitting into clusters identifying diet drink, cola, and used by women, whereas Diet Pepsi would fit into only the first two benefit clusters. Brands might compete across product categories. V8 drink would compete against other vegetable/fruit drinks, as well as against soft drinks and even as a between meals snack. A cluster of toothpaste users might show that Aqua-Fresh toothpaste appeals to the fresh breath, decay prevention, and brighteners clusters, while Crest may appeal to only the decay prevention benefit cluster. Overlapping clustering simply allows for patterns of overlapping to be considered. Arabie [1977], Shepard and Arabie [1979], Arabie and Carroll [1980], Arabie, Carroll, DeSarbo and Wind [1981] outline methods for overlapping clustering, but point out that limitations do occur in practice. First, it is difficult to develop an algorithm that effectively considers all possible cluster overlap options, especially if the sample size is large. Second, most overlapping clustering algorithms produce too many clusters with excessive overlap. A high degree of overlap results in poor configuration recovery, or in other words, a great mathematical model that is difficult to visualize from the data.Shepard and Arabie [1979] provide a detailed explanation of their ADCLUS (for “additive clustering”) model. The ADCLUS model represents a set of m clusters which may or may not be overlapping. Each cluster is assigned a numerical weight, wk, where k=1,…,m. The similarity between any pair of points is predicted in the model as the sum of the weights of those clusters that contain the pair. Arabie and Carroll [1980] and Arabie, Carroll, DeSarbo and Wind [1981] further develop the ability to fit the ADCLUS by presenting the MAPCLUS (for MAthematical Programming CLUStering) algorithm. This implementation appears to meet the needs of clustering items in more than a single cluster. In addition, clusters may be added, deleted, or modified to produce constrained solutions [Carroll and Arabie, 1980], and estimate (in a regression sense) the importance of new sets of clusters in explaining variance in the data. The importance of overlapping clustering is self evident, particularly in applications where clusters are not mutually exclusive, but are overlapping. This reality reflects the existence of multi-attribute decision rules in decision making behavior, divergent product application or use scenarios, and even joint decisions made by multiple users within the same household.

Summary

This section has considered a companion objective of the scaling of similarities and preference data–the use of metric and nonmetric approaches in data reduction and taxonomy. We have pointed out that many of the multidimensional scaling programs can serve useful functions as types of nonmetric factoring procedures. Moreover, clustering procedures are often a helpful adjunct in data analysis when one desires to group objects (or variables) according to their relative similarity. We first discussed metric approaches to reduced space analysis, more specifically focusing on principal components. This was followed by brief descriptions of nonmetric analogues to factor analysis, including several of the algorithms originally discussed in the context of similarities and preference data. We then turned to a description of clustering methods and addressed the topics of association measures, grouping algorithms, cluster descriptions and statistical inference. This led to presentation of some pilot research utilizing cluster analysis, in examining the performance structure of the automobile market. We concluded the section with a description of the general problem of portraying data structures that consist of mixtures of categorical and dimensional variables and a discussion of the usefulness of overlapping clustering.

  1. We note that factor analysis can be used to cluster respondents and cluster analysis can be used to group variables. This is done by transposing the data matrix (i.e., using a variables by subjects data matrix rather than the more common subjects by variables data matrix).
  2. See Jackson [1983] for a step-by-step demonstration of the single, complete, and average linkage rules.
Concept and Application
PAUL E. GREEN, Wharton School, University of Pennsylvania

FRANK J. CARMONE, JR., Wright State University

SCOTT M. SMITH, Brigham Young University

Original Publication:Allyn and Bacon 1970, 1989

ISBN 0-205-11657-4

SECTION FIVE: DIMENSION REDUCING METHODS AND CLUSTER ANALYSIS 110

THE DATA MATRIX 110 REDUCED SPACE ANALYSIS 111

Metric Reduction–Principal Components Analysis 112

TABLE 5-1 113 Figure 5-1 113 Figure 5.2 113

Nonmetric Reduced Space Analysis 114 CLUSTER ANALYSIS 114

Basic Questions in Cluster Analysis 115

Choice of Proximity Measure 116

Distance-Type Measures 116

Other Euclidean Distance Measures 117

Hierarchical Methods 118

Figures 5-3 120

Figure 5-4 120

Single Linkage 120

Complete Linkage 120

Average Linkage 120

The Howard-Harris Clustering Method 120

Factor Analytic Methods 121

Matching-Type Measures 122

Describing the Clusters 124

AN APPLICATION OF REDUCED SPACE AND CLUSTER ANALYSIS 125

TABLE 5-5 126

Figure 5-7 127

Figure 5-8 128

Summary of Study 129

RECENT DEVELOPMENTS IN CLUSTERING TECHNIQUES 129

Nominal vs. Dimensional Structures 129

Figure 5-9 130

Overlapping Clustering Techniques 131

SUMMARY 132