# Statistical Test Assumptions & Technical Details

### What's on This Page:

Stats iQ selects statistical tests with the goal of making statistical testing intuitive and error-free.

This page describes overarching themes of Stats iQ’s approach, and the following describe specific decisions for specific tests:

## Basic Assumptions

Whenever possible, Stats iQ defaults to tests that have fewer assumptions. For example, independent samples t-tests can be calculated in several ways, depending on whether equally sized samples or variances are assumed. Stats iQ runs the test with the least assumptions.

In addition, Stats iQ intelligently mitigates violations of the assumptions of statistical tests. For example, t-tests on relatively small samples require normally distributed data to be accurate. Outliers or non-normal distributions create misleading results. Every datapoint of

[1, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10]

is lower than every datapoint in

[11, 12, 13, 13, 14, 14, 15, 15, 15, 16, 16, 17, 17, 18, 19, 2000]

but an independent samples t-test on those groups does not yield a statistically significant difference because the outlier 2000 violates t-test assumptions. Stats iQ notices the outlier and recommends a ranked t-test instead, which does yield a very clear difference between the groups.

## Rank Transformations

Stats iQ frequently uses the rank transform method for running nonparametric tests when violations of parametric test assumptions are detected. Stats iQ’s rank transformation replaces values with their rank ordering—for example

[86, 95, 40] is transformed to [2, 3, 1]

—then runs the typical parametric test on the transformed data. Tied values are given the average rank of the tied values, so

[11, 35, 35, 52] becomes [1, 2.5, 2.5, 4].

Most commonly encountered in the difference between Pearson and Spearman correlations, rank-transformed tests are robust to non-normal distributions and outliers, and are conceptually simpler than using slightly more common nonparametric tests.

## ANOVA

When users select one categorical variable with three or more groups and one continuous or discrete variable, Stats iQ runs a one-way ANOVA (Welch’s F test) and a series of pairwise “post hoc” tests (Games-Howell tests). The one-way ANOVA tests for an overall relationship between the two variables, and the pairwise tests test each possible pair of groups to see if one group tends to have higher values than the other.

### Assumptions of Welch’s F Test ANOVA

Stats iQ recommends an unranked Welch’s F test if several assumptions about the data hold:

- The sample size is greater than 10 times the number of groups in the calculation (groups with only one value are excluded), and therefore the Central Limit Theorem satisfies the requirement for normally distributed data.
- There are few or no outliers in the continuous/discrete data.

Unlike the slightly more common F test for *equal* variances, Welch’s F test does not assume that the variances of the groups being compared are equal. Assuming equal variances leads to less accurate results when variances are not in fact equal, and its results are very similar when variances are actually equal (Tomarken and Serlin, 1986).

### Ranked ANOVA

When assumptions are violated, the unranked ANOVA may no longer be valid. In that case, Stats iQ recommends the *ranked* ANOVA (also called “ANOVA on ranks”); Stats iQ rank-transforms the data (replaces values with their rank ordering) and then runs the same ANOVA on that transformed data.

The ranked ANOVA is robust to outliers and non-normally distributed data. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and is most commonly seen in the difference between Pearson and Spearman correlation. Rank transformation followed by Welch’s F test is similar in effect to the Kruskal-Wallis Test (Zimmerman, 2012).

Note that Stats iQ’s ranked and unranked ANOVA effect sizes (Cohen’s f) are calculated using the F value from the F test for equal variances.

### Assumptions of Games-Howell Pairwise Test

Stats iQ runs Games-Howell tests regardless of the outcome of the ANOVA test (as per Zimmerman, 2010). Stats iQ shows unranked or ranked Games-Howell pairwise tests based on the same criteria as those used for ranked vs. unranked ANOVA; so if you see “Ranked ANOVA” in the advanced output, the pairwise tests will also be ranked.

The Games-Howell is essentially a t-test for unequal variances that accounts for the heightened likelihood of finding statistically significant results by chance when running many pairwise tests. Unlike the slightly more common Tukey’s b test, the Games-Howell test does not assume that the variances of the groups being compared are equal. Assuming equal variances leads to less accurate results when variances are not in fact equal, and its results are very similar when variances are actually equal (Howell, 2012).

Note that while the unranked pairwise test tests for the equality of the *means* of the two groups, the ranked pairwise test does not explicitly test for differences between the groups’ means or medians. Rather, it tests for a general tendency of one group to have larger values than the other.

Additionally, while Stats iQ does not show results of pairwise tests for any group with less than 4 values, those groups are included in calculating the degrees of freedom for the other pairwise tests.

### Additional ANOVA considerations

- With smaller sample sizes, data can still be visually inspected to determine if it is in fact normally distributed; if it is, unranked t-test results are still valid even for small samples. In practice, this assessment can be difficult to make, so Stats iQ recommends ranked t-tests by default for small samples.
- With larger sample sizes, outliers are less likely to negatively affect results. Stats iQ uses Tukey’s “outer fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.
- Data like
*Highest level of education completed*or*Finishing order in marathon*are unambiguously ordinal. Though Likert scales (like a 1 to 7 scale where 1 is*Very dissatisfied*and 7 is*Very satisfied*) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., with an unranked t-test).

## Stats iQ Contingency Tables

When users select two categorical variables, Stats iQ assesses whether those two variables are statistically related. Stats iQ runs Fisher’s exact test when possible, and otherwise runs Pearson’s chi-squared test (typically just called “chi-squared”).

### Chi-squared vs. Fisher’s Exact Test

Fisher’s exact test is unbiased whenever it can be run, but it is computationally difficult to run if the table is greater than 2 x 2 or the sample size is greater than 10,000 (even with modern computing). Chi-squared tests can have biased results when sample sizes are low (technically, when expected cell counts are below 5).

Fortunately, the two tests are complementary in that Fisher’s exact test is typically easy to calculate when chi-squared tests are biased (small samples), and when Fisher’s exact test is difficult to calculate, chi-squared tends to be unbiased (large samples). Insomuch as larger tables with small samples can still create issues (and Stats iQ cannot run a Fisher’s exact test), Stats iQ alerts users to potential complications.

### Adjusted Residuals

Like other statistical software, Stats iQ uses adjusted residuals to assess whether or not an individual cell is statistically significantly above or below expectations. Essentially the adjusted residual asks, “Does this cell have more values in it than I’d expect if there were no relationship between these two variables?”

If you have the data displayed such that each column sums to 100%, you can say “The proportion of Finance/Banking respondents who said they ‘Love their job’ is lower than typical, relative to respondents from other industries.”

Stats iQ shows up to 3 arrows, depending on the p-value calculated from the adjusted residual. Stats iQ will show a different number of arrows depending on the degree of significance of the result. Specifically, we show one arrow if the p-value is less than alpha (1 – confidence level), two arrows if the p-value is less than alpha/5, and three arrows if the p-value is less than alpha/50. For example, if your confidence level was set to 95%:

- p-value <= .05: one arrow
- p-value <= .01: two arrows
- p-value <= .001: three arrows

The calculation of the adjusted residual, and its comparison to specific alpha levels, can be labelled a “z-test” or a “z-test for a sample percentage.” Literature more typically simply says that conclusions were based on adjusted residuals.

### Confidence Intervals

For all binomial confidence intervals, including contingency tables and in Category Describe bar charts, Stats iQ calculates the confidence interval using the Wilson Score Interval.

## Stats iQ Correlations

When users select two continuous or discrete variables, Stats iQ runs a correlation to assess whether those two groups are statistically related. Stats iQ defaults to calculating Pearson’s r, the most common type of correlation; if the assumptions of that test are not met, Stats iQ recommends a ranked version of the same test, calculating Spearman’s rho. Additionally, Stats iQ uses the Fisher Transformation to calculate confidence intervals for the correlation coefficient.

### Assumptions of Pearson’s r

Stats iQ recommends Pearson’s r as a valid measure of correlation if certain assumptions about the data are met:

- There are no outliers in the continuous/discrete data.
- The relationship between the variables is linear (e.g., y = 2x, not y = x^2).

Stats iQ does not display a line of best fit when it detects a violation of these assumptions.

### Ranked Correlation (Spearman’s Rho)

When assumptions are violated, the Pearson’s r may no longer be a valid measure of correlation. In that case, Stats iQ recommends Spearman’s rho; Stats iQ rank-transforms the data (replaces values with their rank ordering) then runs the typical correlation. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and the rank transformation from Pearson to Spearman is the most common (Conover and Iman, 1981). Note that Spearman’s rho still assumes that the relationship between the variables is monotonic.

### Additional Considerations for Correlations

- With larger sample sizes, outliers are less likely to negatively affect results. Stats iQ uses Tukey’s “outer fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.
- Stats iQ identifies a relationship as nonlinear when Spearman’s rho > 1.1 * Pearson’s r and Spearman’s rho are statistically significant.
- Though Likert scales (like a 1 to 7 scale where 1 is Very dissatisfied and 7 is Very satisfied) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., using Pearson’s r).

## Stats iQ T-Tests

When users want to relate a binary variable to a continuous or discrete variable, Stats iQ runs a two-tailed t-test (all statistical testing in Qualtrics is two-tailed, where applicable) to assess whether either of the two groups tends to have higher values than the other for the continuous/discrete variable. Stats iQ defaults to the Welch’s t-test, also known as the t-test for unequal variances; if the assumptions of that test are not met, Stats iQ recommends a ranked version of the same test.

### Assumptions of Welch’s T-Test

Stats iQ recommends Welch’s t-test (hereafter “t-test”) if several assumptions about the data hold:

- The sample size of each group is above 15 (and therefore the Central Limit Theorem satisfies the requirement for normally distributed data).
- There are few or no outliers in the continuous/discrete data.

Unlike the slightly more common t-test for equal variances, Welch’s t-test does not assume that the variances of the two groups being compared are equal. Modern computing has made that assumption unnecessary. Furthermore, assuming equal variances leads to less accurate results when variances are not equal, and its results are no more accurate when variances are actually equal (Ruxton, 2006).

### Ranked T-Test

When assumptions are violated, the t-test may no longer be valid. In that case, Stats iQ recommends the ranked t-test; Stats iQ rank-transforms the data (replaces values with their rank ordering) and then runs the same Welch’s t-test on that transformed data. The ranked t-test is robust to outliers and non-normally distributed data. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and is most commonly seen in the difference between Pearson and Spearman correlation (Conover and Iman, 1981). Rank transformation followed by Welch’s t-test is similar in effect to the Mann-Whitney U Test, but somewhat more efficient (Ruxton, 2006; Zimmerman, 2012).

Note that while the t-test tests for the equality of the means of the two groups, the ranked t-test does not explicitly test for differences between the groups means or medians. Rather, it tests for a general tendency of one group to have larger values than the other.

### Other Considerations for T-Tests

- With sample sizes below 15, data can still be visually inspected to determine if it is normally distributed; if it is, unranked t-test results are still valid even for small samples. In practice, this assessment can be difficult to make, so Stats iQ recommends ranked t-tests by default for small samples.
- With larger sample sizes, outliers are less likely to negatively affect results. Stats iQ uses Tukey’s “outer fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.
- Data like “Highest level of education completed” or “Finishing order in a marathon” are unambiguously ordinal. Though Likert scales (like a 1 to 7 scale where 1 is Very dissatisfied and 7 is Very satisfied) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., with an unranked t-test).

## Regression

There are two main types of regression run in Stats iQ. If the output variable is a numbers variable, Stats iQ will run a linear regression. If the output variable is a categories variable, Stats iQ will run a logistic regression. The default output for a linear regression is a combination of Relative Importance (specifically, Johnson’s Relative Weights) and Ordinary Least Squares. When running an “Ordinary Least Squares” regression, Stats iQ uses the variation called “M-estimation,” which is a more modern technique that dampens the effect of outliers, leading to more accurate results.

See more at Regression & Relative Importance.