- XM for Customer Frontlines
  - Digital
    Root out friction in every digital experience, super-charge conversion rates, and optimise digital self-service
  - Care
    Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
  - Locations
    Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground
- XM for People Teams
  - Engage
    Know exactly how your people feel and empower managers to improve employee engagement, productivity, and retention
  - Lifecycle
    Take action in the moments that matter most along the employee journey and drive bottom line growth
  - Analytics
    Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
- XM for Strategy & Research
  - Research
    Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
  - User Experience
    Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
  - Brand
    Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
- XM Platform
  Meet the operating system for experience management
  - Free Account
  - Watch Demo
- Teams
  - For Digital
  - For Customer Care
  - For Human Resources
  - For Researchers
  - All Teams
- Industries
  - Healthcare
  - Education
  - Financial Services
  - Government
  - All Industries
- Popular Use Cases
  - Customer Experience
  - Employee Experience
  - Employee Exit Interviews
  - Net Promoter Score
  - Voice of Customer
- - Free Account
  - Watch Demo
- Customer
  - Customer Success Hub
  - Product Documentation
  - Training & Certification
  - Community
  - XM Institute
- Learn
  - Popular Resources
  - Customer Stories
  - Blog
  - XM Knowledge Base
- Company
  - About Us
  - Careers
  - Partnerships
  - Marketplace
  - X4 Summit
    The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results.
- - Free Account
  - Watch Demo
PRICING
LOGIN
SUPPORT

Try Qualtrics for free

Free account

What is cluster analysis and when should you use it?

8 min read
Cluster analysis can be a powerful data-mining tool for any organisation that needs to identify discrete groups of customers, sales transactions, or other types of behaviours and things. For example, insurance providers use cluster analysis to detect fraudulent claims, and banks use it for credit scoring.

Cluster analysis definition

Cluster analysis is a statistical method for processing data. It works by organising items into groups, or clusters, on the basis of how closely associated they are.

Cluster analysis, like reduced space analysis (factor analysis), is concerned with data matrices in which the variables have not been partitioned beforehand into criterion versus predictor subsets. The objective of cluster analysis is to find similar groups of subjects, where “similarity” between each pair of subjects means some global measure over the whole set of characteristics.

Cluster analysis is an unsupervised learning algorithm, meaning that you don’t know how many clusters exist in the data before running the model. Unlike many other statistical methods, cluster analysis is typically used when there is no assumption made about the likely relationships within the data. It provides information about where associations and patterns in data exist, but not what those might be or what they mean.

In this article, we discuss various methods of clustering and the key role that distance plays as measures of the proximity of pairs of points.

How is cluster analysis used?

The most common use of cluster analysis is classification. Subjects are separated into groups so that each subject is more similar to other subjects in its group than to subjects outside the group.

In a market research context, this might be used to identify categories like age groups, earnings brackets, urban, rural or suburban location.

In marketing, cluster analysis can be used for audience segmentation, so that different customer groups can be targeted with the most relevant messages.

Healthcare researchers might use cluster analysis to find out whether different geographical areas are linked with high or low levels of certain illnesses, so they can investigate possible local factors contributing to health problems.

Whatever the application, data cleaning is an essential preparatory step for successful cluster analysis. Clustering works at a data-set level where every point is assessed relative to the others, so the data must be as complete as possible.

Clustering is measured using intracluster and intercluster distance.

Intracluster distance is the distance between the data points inside the cluster. If there is a strong clustering effect present, this should be small (more homogenous).
Intercluster distance is the distance between data points in different clusters. Where strong clustering exists, these should be large (more heterogenous).

The linkage between clusters refers to how different or similar two clusters are to one another.

eBook: 8 innovations to modernise market research

Free Download

Basic questions in cluster analysis

In an introduction to clustering procedures, it makes sense to focus on methods that assign each subject to only one class. Subjects within a class are usually assumed to be indistinguishable from one another.

We assume that the underlying structure of the data involves an unordered set of discrete classes. They’re all different, and none has more weight than another. In some cases, we may also view these classes as hierarchical in nature, with some classes divided into subclasses.

Clustering procedures can be viewed as “pre-classificatory” in the sense that the researcher has not used prior judgment to partition the subjects (rows of the data matrix). However, it is assumed that some of the objectives are heterogeneous; that is, that “clusters” exist.

This presupposition of different groups is based on commonalities within the set of inputs into the algorithm, or clustering variables. This assumption is different from the one made in the case of discriminant analysis or automatic interaction detection, where the dependent variable is used to formally define groups of objects and the distinction is not made on the basis of profile resemblance in the data matrix itself.

Thus, given that no information on group definition is formally evaluated in advance, the imperative questions of cluster analysis will be:

What measure of inter-subject similarity is to be used and how is each variable to be “weighted” in the construction of such a summary measure?
After inter-subject similarities are obtained, how are the classes to be formed?
After the classes have been formed, what summary measures of each cluster are appropriate in a descriptive sense; that is, how are the clusters to be defined?
Assuming that adequate descriptions of the clusters can be obtained, what inferences can be drawn regarding their statistical significance?

What about non-scalar data?

So far, we’ve talked about scalar data – things differ from each other by degrees along a scale, such as numerical quantity or degree. But what about items that are non-scalar and can only be sorted into categories (as with things like color, species or shape)?

This question is important for applications like survey data analysis, since you’re likely to be dealing with a mix of formats that include both categorical and scalar data.

Cluster analysis algorithms

Your choice of cluster analysis algorithm is important, particularly when you have mixed data. In major statistics packages you’ll find a range of preset algorithms ready to number-crunch your matrices. Here are two of the most suitable for cluster analysis.

K-Means algorithm establishes the presence of clusters by finding their centroid points. A centroid point is the average of all the data points in the cluster. By iteratively assessing the Euclidean distance between each point in the dataset, each one can be assigned to a cluster. The centroid points are random to begin with and will change each time as the process is carried out.K-means is commonly used in cluster analysis, but it has a limitation in being mainly useful for scalar data.
K-Medoids works in a similar way to k-means, but rather than using mean centroid points which don’t equate to any real points from the dataset, it establishes medoids, which are real interpretable data-points.K-medoids offers an advantage for survey data analysis as it is suitable for both categorical and scalar data. This is because rather than measuring Euclidean distance between the medoid point and its neighbours, the algorithm can measure distance in multiple dimensions, representing a number of different categories or variables.

In both cases (k) = the number of clusters.

Cluster analysis + factor analysis

When you’re dealing with a large number of variables, for example a lengthy or complex survey, it can be useful to simplify your data before performing cluster analysis so that it’s easier to work with. Using factors reduces the number of dimensions that you’re clustering on, and can result in clusters that are more reflective of the true patterns in the data.

Factor analysis is a technique for taking large numbers of variables and combining those that relate to the same underlying factor or concept, so that you end up with a smaller number of dimensions. For example, factor analysis might help you replace questions like “Did you receive good service?” “How confident were you in the agent you spoke to?” and “Did we resolve your query?” with a single factor – customer satisfaction.

This way you can reduce messiness and complexity in your data and arrive more quickly at a manageable number of clusters.

eBook: 8 innovations to modernise market research

Free Download

Try Qualtrics for free

What is cluster analysis and when should you use it?

Cluster analysis definition

How is cluster analysis used?

Basic questions in cluster analysis

What about non-scalar data?

Cluster analysis algorithms

Cluster analysis + factor analysis

Related resources

SEE MORE

Support

Company

Resources

Try Qualtrics for free

What is cluster analysis and when should you use it?

Cluster analysis definition

How is cluster analysis used?

Basic questions in cluster analysis

What about non-scalar data?

Cluster analysis algorithms

Cluster analysis + factor analysis

Related resources

Margin of Error 11 min read

Text Analysis 44 min read

Sentiment Analysis 21 min read

Behavioural Analytics 12 min read

SEE MORE

Descriptive Statistics 15 min read

Statistical Significance Calculator 18 min read

Zero-Party Data 12 min read

Margin of Error
11 min read

Text Analysis
44 min read

Sentiment Analysis
21 min read

Behavioural Analytics
12 min read

Descriptive Statistics
15 min read

Statistical Significance Calculator
18 min read

Zero-Party Data
12 min read