First things first – What is data cleaning?
Cleaning data means getting rid of any anomalous, incorrectly filled or otherwise “odd” results that could skew your analysis.
Some examples include:
- Straight-lining, where the respondent has selected the first response to every question, regardless of the question.
- Christmas-trees, where answers have been selected to create a visual pattern or picture – resembling a Christmas tree or some other deliberate design – rather than in response to the survey questions.
Not all problematic results are deliberate – you may also find duplicate responses caused by people accidentally filling in a survey twice or failing to realise that their submission had completed.
When you clean your survey data, you’re eliminating these ‘noisy’ responses that don’t add value and can confuse your end results. Think of it like weeding your garden to give your best plants more room to grow.
How to find the ‘dirt’ when data cleaning
There are a few methods experienced survey designers use spot the results that should be weeded out. These can involve looking at the metadata of the survey or visualising data to uncover patterns.
Find the fastest respondents
Time data can show where respondents have whizzed through a survey selecting answers without properly reading and considering the questions. Setting a ‘speed limit’ for your responses can help eliminate thoughtless or random answers.
Turn numeric data into graphics
For issues like Christmas tree or straight-lining respondents, it can be easier to spot problems if your data appears as a chart or graph rather than a table of numbers.
Review open-ended questions
Where your survey design requires participants to answer in their own words, you can spot problem data by noting where the open fields have been filled in with nonsense text. This could indicate that they survey has been completed by a bot rather than a human or where the survey respondent was not engaged with questions.
Edge cases to consider when cleaning data
Sometimes deciding whether to exclude certain survey responses from your final data set isn’t clear-cut. In these situations, you’ll need to make a choice depending on the volume of data you have and your overall goals for the survey.
These are answers that are numerically miles away from the rest of your data, or seem implausible from a common-sense point of view. This could be something like selecting a number above 16 for “how many hours a day do you spend watching TV”. It could be the result of a user error or a misunderstanding of the question. Or in some cases, it could be an unusual but accurate reply.
If a respondent’s answers seem inconsistent or don’t add up to a coherent picture, it could mean they’ve answered without reading carefully. For example, in one question they might tell you they’re vegetarian and in another tick ‘bacon’ as a favorite food.
If a respondent has missed out whole sections of your survey altogether, you need to consider whether the rest of their answers should be included or not. As well as potentially skewing the results numerically, this situation could point to someone answering at random or not paying sufficient attention to the questions.