Pollsters Missed the 2016 Election; Don’t Make the Same Mistakes
The polling industry is reeling after the 2016 election.
While the absolute level of errors was not unreasonable for most of the polls (between 3-4 percentage points), those errors were not randomly distributed, and in fact nearly all of the polls missed in the same direction—favoring Hillary Clinton over Donald Trump. When errors point in one direction rather than being randomly distributed, the data contain a bias.
So what went wrong? While it is too early to tell conclusively, it seems that certain segments of the voting population were less willing to participate in the pre-election polls altogether, and these segments voted overwhelmingly for Donald Trump. This problem is known in survey methodology as “biased nonresponse” and occurs anytime when people who don’t respond to our surveys are systematically different from the people who do. While it may be tempting to believe that biased nonresponse is specific to the political polling industry, it threatens the validity of nearly all survey research. Yet very few researchers take steps to assess or address it.
How biased nonresponse hurts your data
Most researchers know that they will come to incorrect conclusions if only happy customers or unhappy employees participate in their surveys. Both of these cases are simplistic examples of biased nonresponse. When this happens, the results are not generalizable to the population that the sample of respondents was drawn from.
Nevertheless, not all instances of nonresponse bias are obvious. Consider a scenario in which Qualtrics (we could use any company here but we’ll keep it close to home) conducts a survey of our customers and achieves a respectable response rate of 50%. However, the research lead is unaware that there is a nonresponse bias that leads our customers that use the Insight Platform more frequently to be twice as likely to participate in the survey when compared with customers that use the software less frequently, even though both groups might be equally large in terms of the population of our customers.
If our researcher were unaware of this mismatch between our sample and survey population, she would likely analyze the data without any adjustments and take the results to organizational stakeholders, like our marketing or engineering departments. The fact that there is an undetected nonresponse bias that is related to an outcome that we care about (software usage frequency) means that the decisions made based on the data are also likely wrong. Ultimately, Qualtrics would make bad decisions based on such data.
How do you know if your research is being affected by biased nonresponse?
While most experienced researchers know that this kind of scenario can occur, very few take the steps necessary to determine whether or not survey results are being influenced by nonresponse bias. Determining whether there is nonresponse bias requires some additional steps in the research process, but especially when researching a known population (e.g., a list of customers or employees), it is actually not that difficult and is worth the added effort.
Follow these basic steps necessary to identify the nonresponse bias.
- Step 1: Identify auxiliary variables that exist for all members of the population (e.g. customers or employees) prior to survey design.
- Step 2: Design and field your survey to collect the survey data
In the example above, Qualtrics would want to include questions to identify key drivers of software usage frequency.
- Step 3: Merge the datasets that contain the auxiliary variables and the survey data.
In the case of the Qualtrics example above where our response rate was 50%, the merged dataset will contain twice as many rows (cases) as the survey dataset alone. This is because customers that participated in the survey will have the survey data added but those that did not will stay in the dataset but with only their auxiliary variables included.
- Step 4: Create a variable that indicates whether a customer did or did not participate in the survey.
In most cases, the simplest approach is to create a new variable (column) that is 0 if the customer did not participate in the survey and 1 if the customer did participate in the survey
- Step 5: Calculate basic descriptive statistics (mean, median, variance, etc) for the auxiliary variables for those companies that participated in the survey and compare against those that did not participate in the survey
In the case of the Qualtrics example, we would pay special attention to the auxiliary variable about frequency of Qualtrics software use. Because the hypothetical scenario indicated that there is a nonresponse bias, our researcher would quickly discover that there are substantial differences in the frequency of software use between those customers that responded to the survey and those that did not.
In the case above, the critical variable for Qualtrics is the frequency of software usage, information that our researcher can include in the sampling list. Our researcher might also identify other areas where nonresponse bias may be a problem: things like customer size and annual spend could also be helpful to assess.
You’ve identified nonresponse bias in your data, now what?
Just because a nonresponse bias exists in the data does not mean that the data are useless. In fact, it is entirely possible to find a nonresponse bias that is completely uncorrelated with outcomes that matter. For example, in the 2016 election, had the only nonresponse bias been that tall people were much more likely to participate in the pre-election polls than short people, it is unlikely that there would have been any impact on the accuracy of polling predictions.
Even when nonresponse bias is present it can often be adjusted for using weighting. Recall that in our hypothetical scenario the bias turned out to be that frequent users were twice as likely to participate in our customer survey as our infrequent users. To adjust for this bias, we could simply put a weight of less than one on the frequent users that participated in the survey and a weight of more than one on the infrequent users that participated in the survey, with the weights formulated such that they average 1. A more sophisticated weighting scheme could account for multiple factors.
By making these adjustments, you can be much more confident that the conclusions you reach about your sample of respondents will also be generalizable to the broader population that your sample comes from. Without looking for evidence of nonresponse bias it is difficult to be certain that your results are actually representative unless you are using probability-based sampling from a known population.
Important caveat: You can only adjust for nonresponse bias that you can measure. For example, in the scenario above, if Qualtrics did not have information about the frequency of software use by our customers, in the form of auxiliary data, we would not have been able to assess whether there was nonresponse bias, much less adjust for it with weighting. Alternatively, if none of the infrequent users participated in the survey at all it would be impossible to make a weighting adjustment. In this case, knowing about the nonresponse bias would tell you that the data are unusable.
Qualtrics can help
If you are concerned about the potential effects of nonresponse bias in your data, we can direct you to resources to help you better understand how you can address this problem. In some cases, we offer consulting services to aid in implementing this kind of assessment and adjustment.