Stay up-to-date with every new feature in Qualtrics XM

LEARN MORE

Ready to try Qualtrics?

GET STARTED

Interpreting Residual Plots to Improve Your Regression

Suite

Customer Experience Employee Experience Strategy & Research

Product

Qualtrics

What's on this page

When you run a regression, Stats iQ automatically calculates and plots residuals to help you understand and improve your regression model. Read below to learn everything you need to know about interpreting residuals (including definitions and examples).

Observations, Predictions, and Residuals

To demonstrate how to interpret residuals, we’ll use a lemonade stand dataset, where each row was a day of “Temperature” and “Revenue.”

Temperature (Celsius)	Revenue
28.2	$44
21.4	$23
32.9	$43
24.0	$30
etc.	etc.

The regression equation describing the relationship between “Temperature” and “Revenue” is:

Revenue = 2.7 * Temperature – 35

Let’s say one day at the lemonade stand it was 30.7 degrees and “Revenue” was $50. That 50 is your observed or actual output, the value that actually happened.

So if we insert 30.7 at our value for “Temperature”…

Revenue = 2.7 * 30.7 – 35
Revenue = 48

…we get $48. That’s the predicted value for that day, also known as the value for “Revenue” the regression equation would have predicted based on the “Temperature.”

Your model isn’t always perfectly right, of course. In this case, the prediction is off by 2; that difference, the 2, is called the residual. The residual is the bit that’s left when you subtract the predicted value from the observed value.

Residual = Observed – Predicted

You can imagine that every row of data now has, in addition, a predicted value and a residual.

Temperature (Celsius)	Revenue (Observed)	Revenue (Predicted)	Residual (Observed – Predicted)
28.2	$44	$41	$3
21.4	$23	$23	$0
32.9	$43	$54	-$11
24.0	$30	$29	$1
etc.	etc.	etc.	etc.

We’re going to use the observed, predicted, and residual values to assess and improve the model.

Understanding Accuracy with Observed vs. Predicted

In a simple model like this, with only two variables, you can get a sense of how accurate the model is just by relating “Temperature” to “Revenue.” Here’s the same regression run on two different lemonade stands, one where the model is very accurate, one where the model is not:

Graph of accurate versus inaccurate model predictions

It’s clear that for both lemonade stands, a higher “Temperature” is associated with higher “Revenue.” But at a given “Temperature,” you could forecast the “Revenue” of the left lemonade stand much more accurately than the right lemonade stand, which means the model is much more accurate.

But most models have more than one explanatory variable, and it’s not practical to represent more variables in a chart like that. So instead, let’s plot the predicted values versus the observed values for these same datasets.

Graphs of Predicted versus Actual Values for accurate and inaccurate models

Again, the model for the chart on the left is very accurate; there’s a strong correlation between the model’s predictions and its actual results. The model for the chart on the far right is the opposite; the model’s predictions aren’t very good at all.

Note that these charts look just like the “Temperature” vs. “Revenue” charts above them, but the x-axis is predicted “Revenue” instead of “Temperature.” That’s common when your regression equation only has one explanatory variable. More often, though, you’ll have multiple explanatory variables, and these charts will look quite different from a plot of any one explanatory variable vs. “Revenue.”

Examining Predicted vs. Residual (“The Residual Plot”)

The most useful way to plot the residuals, though, is with your predicted values on the x-axis and your residuals on the y-axis.

(Stats iQ presents residuals as standardized residuals, which means every residual plot you look at with any model is on the same standardized y-axis.)

Graph of Predicted versus Actual values and Graph of Standardized Residuals

In the plot on the right, each point is one day, where the prediction made by the model is on the x-axis and the accuracy of the prediction is on the y-axis. The distance from the line at 0 is how bad the prediction was for that value.

Since…

Residual = Observed – Predicted

…positive values for the residual (on the y-axis) mean the prediction was too low, and negative values mean the prediction was too high; 0 means the guess was exactly correct.

Ideally your plot of the residuals looks like one of these:

Examples of ideal Standardized Residual Plots

That is,
(1) they’re pretty symmetrically distributed, tending to cluster towards the middle of the plot.
(2) they’re clustered around the lower single digits of the y-axis (e.g., 0.5 or 1.5, not 30 or 150).
(3) in general, there aren’t any clear patterns.

Here’s some residual plots that don’t meet those requirements:

Examples of undesirable Standardized Residual Plots

These plots aren’t evenly distributed vertically, or they have an outlier, or they have a clear shape to them.

If you can detect a clear pattern or trend in your residuals, then your model has room for improvement.

In a second we’ll break down why and what to do about it.

Normal Q-Q Residual Plot:

Click Show Normal Q-Q residual plot to display a Q-Q plot assessing data skew and model fit. This chart displays the standardized residuals on the y-axis and the theoretical quantiles on the x-axis.

Shows a QQ Distribution for model fit available in linear regressions of stats iQ.

Data that aligns closely to the dotted line indicates a normal distribution. If the points skew drastically from the line, you could consider adjusting your model by adding or removing other variables in the regression model.

How much does it matter if my model isn’t perfect?

How concerned should you be if your model isn’t perfect, if your residuals look a bit unhealthy? It’s up to you.

If you’re publishing your thesis in particle physics, you probably want to make sure your model is as accurate as humanly possible. If you’re trying to run a quick and dirty analysis of your nephew’s lemonade stand, a less-than-perfect model might be good enough to answer whatever questions you have (e.g., whether “Temperature” appears to affect “Revenue”).

Most of the time a decent model is better than none at all. So take your model, try to improve it, and then decide whether the accuracy is good enough to be useful for your purposes.

Example Residual Plots and Their Diagnoses

If you’re not sure what a residual is, take five minutes to read the above, then come back here.

Below is a gallery of unhealthy residual plots. Your residual may look like one specific type from below, or some combination.

If yours looks like one of the below, click that residual to understand what’s happening and learn how to fix it.

(Throughout we’ll use a lemonade stand’s “Revenue” vs. that day’s “Temperature” as an example dataset.)

Y-axis Unbalanced

Show details about this plot, and how to fix it.

Problem

Imagine that for whatever reason, your lemonade stand typically has low revenue, but every once and a while you get very high-revenue days, such that “Revenue” looked like this…

Skewed histogram of Revenue for Lemonade Stand example

…instead of something more symmetrical and bell-shaped like this:

Symmetrical histogram of Revenue for Lemonade Stand Example

So “Temperature” vs. “Revenue” might look like this, with most of the data bunched at the bottom…

Temperature versus Revenue for skewed Lemonade data

The black line represents the model equation, the model’s prediction of the relationship between “Temperature” and “Revenue.” Look above at each prediction made by the black line for a given “Temperature” (e.g., at “Temperature” 30, “Revenue” is predicted to be about 20). You can see that the majority of dots are below the line (that is, the prediction was too high), but a few dots are very far above the line (that is, the prediction was far too low).

Translating that same data to the diagnostic plots, most of the equation’s predictions are a bit too high, and then some would be way too low.

Predicted versus actual and Residual plots for Lemonade Example

Implications

This almost always means your model can be made significantly more accurate. Most of the time you’ll find that the model was directionally correct but pretty inaccurate relative to an improved version. It’s not uncommon to fix an issue like this and consequently see the model’s r-squared jump from 0.2 to 0.5 (on a 0 to 1 scale).

How to Fix

The solution to this is almost always to transform your data, typically your response variable.
It’s also possible that your model lacks a variable.