Cluster analysis definition
Cluster analysis is a statistical method for processing data. It works by organising items into groups, or clusters, on the basis of how closely associated they are.
Cluster analysis, like reduced space analysis (factor analysis), is concerned with data matrices in which the variables have not been partitioned beforehand into criterion versus predictor subsets. The objective of cluster analysis is to find similar groups of subjects, where “similarity” between each pair of subjects means some global measure over the whole set of characteristics.
Cluster analysis is an unsupervised learning algorithm, meaning that you don’t know how many clusters exist in the data before running the model. Unlike many other statistical methods, cluster analysis is typically used when there is no assumption made about the likely relationships within the data. It provides information about where associations and patterns in data exist, but not what those might be or what they mean.
In this article, we discuss various methods of clustering and the key role that distance plays as measures of the proximity of pairs of points.
How is cluster analysis used?
The most common use of cluster analysis is classification. Subjects are separated into groups so that each subject is more similar to other subjects in its group than to subjects outside the group.
In a market research context, this might be used to identify categories like age groups, earnings brackets, urban, rural or suburban location.
In marketing, cluster analysis can be used for audience segmentation, so that different customer groups can be targeted with the most relevant messages.
Healthcare researchers might use cluster analysis to find out whether different geographical areas are linked with high or low levels of certain illnesses, so they can investigate possible local factors contributing to health problems.
Whatever the application, data cleaning is an essential preparatory step for successful cluster analysis. Clustering works at a data-set level where every point is assessed relative to the others, so the data must be as complete as possible.
Clustering is measured using intracluster and intercluster distance.
- Intracluster distance is the distance between the data points inside the cluster. If there is a strong clustering effect present, this should be small (more homogenous).
- Intercluster distance is the distance between data points in different clusters. Where strong clustering exists, these should be large (more heterogenous).
The linkage between clusters refers to how different or similar two clusters are to one another.
Basic questions in cluster analysis
In an introduction to clustering procedures, it makes sense to focus on methods that assign each subject to only one class. Subjects within a class are usually assumed to be indistinguishable from one another.
We assume that the underlying structure of the data involves an unordered set of discrete classes. They’re all different, and none has more weight than another. In some cases, we may also view these classes as hierarchical in nature, with some classes divided into subclasses.
Clustering procedures can be viewed as “pre-classificatory” in the sense that the researcher has not used prior judgment to partition the subjects (rows of the data matrix). However, it is assumed that some of the objectives are heterogeneous; that is, that “clusters” exist.
This presupposition of different groups is based on commonalities within the set of inputs into the algorithm, or clustering variables. This assumption is different from the one made in the case of discriminant analysis or automatic interaction detection, where the dependent variable is used to formally define groups of objects and the distinction is not made on the basis of profile resemblance in the data matrix itself.
Thus, given that no information on group definition is formally evaluated in advance, the imperative questions of cluster analysis will be:
- What measure of inter-subject similarity is to be used and how is each variable to be “weighted” in the construction of such a summary measure?
- After inter-subject similarities are obtained, how are the classes to be formed?
- After the classes have been formed, what summary measures of each cluster are appropriate in a descriptive sense; that is, how are the clusters to be defined?
- Assuming that adequate descriptions of the clusters can be obtained, what inferences can be drawn regarding their statistical significance?
What about non-scalar data?
So far, we’ve talked about scalar data – things differ from each other by degrees along a scale, such as numerical quantity or degree. But what about items that are non-scalar and can only be sorted into categories (as with things like color, species or shape)?
This question is important for applications like survey data analysis, since you’re likely to be dealing with a mix of formats that include both categorical and scalar data.
Cluster analysis algorithms
Your choice of cluster analysis algorithm is important, particularly when you have mixed data. In major statistics packages you’ll find a range of preset algorithms ready to number-crunch your matrices. Here are two of the most suitable for cluster analysis.
- K-Means algorithm establishes the presence of clusters by finding their centroid points. A centroid point is the average of all the data points in the cluster. By iteratively assessing the Euclidean distance between each point in the dataset, each one can be assigned to a cluster. The centroid points are random to begin with and will change each time as the process is carried out.K-means is commonly used in cluster analysis, but it has a limitation in being mainly useful for scalar data.
- K-Medoids works in a similar way to k-means, but rather than using mean centroid points which don’t equate to any real points from the dataset, it establishes medoids, which are real interpretable data-points.K-medoids offers an advantage for survey data analysis as it is suitable for both categorical and scalar data. This is because rather than measuring Euclidean distance between the medoid point and its neighbours, the algorithm can measure distance in multiple dimensions, representing a number of different categories or variables.
In both cases (k) = the number of clusters.
Cluster analysis + factor analysis
When you’re dealing with a large number of variables, for example a lengthy or complex survey, it can be useful to simplify your data before performing cluster analysis so that it’s easier to work with. Using factors reduces the number of dimensions that you’re clustering on, and can result in clusters that are more reflective of the true patterns in the data.
Factor analysis is a technique for taking large numbers of variables and combining those that relate to the same underlying factor or concept, so that you end up with a smaller number of dimensions. For example, factor analysis might help you replace questions like “Did you receive good service?” “How confident were you in the agent you spoke to?” and “Did we resolve your query?” with a single factor – customer satisfaction.
This way you can reduce messiness and complexity in your data and arrive more quickly at a manageable number of clusters.