Survey Platform - Understanding Statistics | Qualtrics

# Understanding Statistics

## Introduction

Welcome to Qualtrics Statistics. This is an overview of basic statistics that may be useful to you as you create and analyze projects in Qualtrics. We are going to cover some basic statistical concepts, apply them to the platform, and discuss some further options outside of Qualtrics.

## Quantitative and Categorical Data

There are two types of data: quantitative and categorical.

Quantitative data is assessed on a numerical scale. Examples of quantitative data include age, height, or income.

Categorical data is assessed on a nominal scale. Examples of categorical data include gender, marital status, or occupation. Most data collected in a survey is categorical, where a count of the number of respondents that fall in a category is obtained.

### Measures of Center

There are three measures of center used for quantitative data: mean, median, and mode.

The mean, or average, is the best measure of center when data is roughly normally distributed or looks like a bell curve. The mean is found by summing all of the observations and dividing by the total number of observations.

$$Mean = \frac{\sum X}{n}$$ X = observations
n = number of observations

The median, or middle value, is a good measure of center when data appears to be skewed. If you line up all of your observation in order, the median is the middle value.

The mode is the value that occurs most frequently in your data. It is not as commonly used as either the mean or the median.

There are a few useful statistics to measure the spread of your data: standard deviation, variance, and range.

A standard deviation is the average distance of the observations from their mean. Like the mean, a standard deviation should be used with roughly normally distributed data.

The variance is simply the standard deviation squared.

The range is the difference between the largest and the smallest value.

## Statistics in Visualizations

A Statistics Table visualization in Qualtrics displays the minimum value, the maximum value, the mean, the variance, the standard deviation, and the total number of responses.

Since a value is coded for each response option in every question, Qualtrics will find these statistics whether the data is quantitative or categorical. It is up to you to decide if these statistics make sense in the context of your study.

For example, you could ask respondents what their favorite color is: red, yellow, blue, or green, coded as 1, 2, 3, and 4 respectively. Qualtrics will give you a mean, but it doesn’t make sense to have an average favorite color.

If you had respondents rate movies on a scale from 1 to 5 stars, a mean would be useful. Means like 2.98 stars or 4.32 stars make movies easy to compare.

Qualtrics offers a variety of charts, graphs, and tables. A Bar Chart shows the frequency of responses in each answer choice category.

A Pie Chart shows these frequencies as a percentage of the pie.

Both Bar Charts and Pie Charts make comparing the frequencies between categories very easy.

A Line Chart is a two-dimensional scatter plot for ordered observations. It is a good way to see trends over time.

A Gauge Chart compares a chosen metric (e.g., average, sum) to a scale. Depending on the where the metric falls, the scale changes color. Gauge Charts are helpful for quickly comparing a value’s expected performance versus its actual performance.

## Cross Tabulations

A great way to analyze categorical data is through a cross tabulation, also called a contingency table or a two-way table. A cross tab records the number of respondents that have the specific characteristics described in the cells of the table.

In this example, you can see the number of each item type that was purchased weekly, monthly, and yearly (e.g., 62 shirts are purchased weekly).

A cross tab consists of columns and rows, or banners and stubs, respectively, where each banner and stub pulls in frequency data from a question. Qualtrics will only allow you to select questions that are compatible for a cross tabulation (e.g., open ended text entry questions are not compatible with a cross tabulation). If you select multiple banners or stubs, they will appear consecutively or side by side. If you add a multilevel drill down to your cross tabulation, one variable will appear as a subcategory of another.

### Chi-Square Test Statistic

The chi-square test statistic tests for significant relationship between a stub and a banner.

In this case, we are testing to see if there is a significant relationship between frequency of visit and type of item purchased.

If you include multiple stubs and banners in your cross tab, Qualtrics will also produce multiple chi-square values, one for each banner and stub combination.

It is beneficial to know how a chi-square test statistic is calculated. First, you need to find the expected count for each cell, or the count you would expect the cell to have based on the row total, column total, and table total. To find an expected count, take the row total times the column total and divide the result by the overall table total.

$$Expected Count = \frac{RC}{T}$$ R = row total
C = column total
T = table total

In our above example, the expected count of weekly visitors that purchase shirts is found by multiplying 248 by 135 and dividing that number by 386.

Qtip: A condition of the chi-square test is that all expected counts must be greater than 5. Otherwise, the test will not be valid. Our expected counts for each cell are all greater than 5.

Once you have the expected count, perform the following calculation:

$$Chi-Square = \sum{\frac{(O – E)^2}{E}}$$ O = observed value
E = expected value

The chi-square test statistic is found by taking the observed value minus the expected value, squaring this difference, and dividing it by the expected value for each cell. These individual chi-square test components are then added together and the result is the chi square test statistic.

### P-Value

The chi-square test statistics, together with the degrees of freedom, are used to find a p-value. A p-value determines whether the association between the two variables is statistically significant. A low p-value means that the observed table relationship would occur with very low probability, so there’s a significant relationship between the two variables. A low p-value is generally considered to be a figure less than .05.

Our p-value is .62, which is not significant. Therefore, there is no relationship between frequency of visit and type of item purchased.

Further analysis for quantitative data, such as correlation and regression, can be taken to Excel or a statistical software package.

### Correlation

The correlation coefficient, r, describes the strength and direction for a roughly linear relationship between two quantitative variables. The value of r always lies within -1 and 1, where values closest to -1 and 1 represent a strong correlation and values close to zero are weak. The plus or minus sign indicates the positive or negative direction of the relationship. Correlation values between -.3 and .3 are considered rather low, while correlation values between .7 and 1 or -.7 and -1 are considered to be high.

A key point to remember is that correlation is not the same as causation. Just because two variables are highly correlated does not mean that one of these variables causes the other one to occur.

### Regression

Regression analysis can be used to make predictions for one variable based on one or more predictor variables. In regression, we’re interested in how much one variable changes with the changes of another variable. A line of best fit can be found for a scatter plot of data.

$$Least\,Squares\,Line: X = a + bY$$ where$$b = r\frac{SD_Y}{SD_X}$$ and $$a = Ȳ – bX̄$$
where
b = slope of the regression line
a = Y-axis intercept
$X̄$ = mean of x values
$Ȳ$ = mean of y values
$SD_Y$ = standard deviation of y
$SD_X$ = standard deviation of x
r = $\frac{(N\sum xy – \sum x\sum y)}{\sqrt{((N\sum x^2) – (\sum y)^2)}}$

This is known as the least squares regression line. Using this line equation, a value can be predicted for the dependent variable based on the predictor variables.