Verifying Accuracy in Statistical Tests
Statistical Analysis in XM
Data analysis is an integral piece of the XM platform that enables Qualtrics customers to discover the deepest and most relevant insights from their experience (X) and operational (O) data. These customers leverage our offerings to make important decisions on the actions they take, which means that incorrect analyses can multiply into suboptimal or even detrimental outcomes. Therefore verifiably correct implementations of the statistical tests that power our insights are an integral part of our business. On our team, ensuring correctness for these statistical tests has presented some unique challenges, and has required adaptations to our normal cycle of testing and verification.
Engineers on the Statwing team at Qualtrics develop products that empower any user to easily perform statistical analyses and extract actionable insights. Our mission is to deliver a best-in-class solution for statistical analysis that works seamlessly with the rest of the XM platform so that even non-expert data analysts can understand and extract valuable information at the click of a button.
One of the main ways in which Statwing achieves its mission is a prioritization of quality. With an emphasis on thorough code reviews and high thresholds for test coverage, we aim to rapidly develop on our feature-filled products, Stats iQ and Crosstabs, without compromising on quality for our users.
Our customers continue to use our platform because they trust we have the right answers to their questions about their business. We’ve learned that verifying the correctness of an algorithm requires more than our normal unit and integration testing workflow. As developers tied deeply into the context of our own specific implementations of statistical formulas, it can often be hard to objectively define a set of expectations as we test. Additionally, for more complicated analyses, theoretical boundaries and edge cases become harder to identify without the right expertise.
We draw externally on published, peer-reviewed research for verified analysis inputs and outputs and we build test fixtures to directly target our statistical formulas. These validation tests act as an independently verified set of expectations, filling in the gaps for what we, as developers, can’t anticipate on our own.
Here are some of our best practices in writing static validation tests:
- Keep statistical formula implementations in their own isolated functions so that each formula can easily be tested with inputs and outputs from static test fixtures. Test the output of the formula directly, not an interpretation of the output that could still be right even if the implementation is a little bit wrong.
As a simplified example, below is a formula for a z-test of independent proportions, which calculates a significance (z-)score based on two proportions:
In StatsiQ and Crosstabs, we often convert numeric significance scores to plain-English categories, (‘Not Significant’, ‘Significant’, ‘Very Significant’), which can cover up inaccuracies in our implementations that only slightly alter the output z-score.
Rather than including formula implementation in the larger function that interprets significance tiers:
we want to separate out the z-score calculation so that it can be independently validated:
Keeping the calculation separate lets us test the formula with stricter expectations on the numeric output, avoiding the risk of incorrect z-score outputs falling in the correct tier and passing our verification.
2. Implement the formula in a secondary language, like R, and compare test inputs and outputs between your app’s function and the R function to catch any obvious discrepancies between the two implementations.
3. Don’t write test fixtures with your own knowledge. When adding static text fixtures, use test cases from outside of your knowledge and code base, ideally ones that have been published and/or are well-known in the field.
4. Don’t gather your own test cases if you’re the one implementing the formula in code. Tag your Product Manager or other Subject Matter Expert to choose test cases so you can avoid potential bias in test selection. Include these test cases in the initial requirements specification to avoid delays later in the development process.
5. Keep test fixtures small. Each test should specify inputs and outputs, all of which should be given descriptive names.
6. Cite your sources in your test fixtures. Include comments describing where the tests came from, how they were validated, and descriptions of the inputs and outputs being tested.
7. Provide links to documentation of the formulas you implemented. These links should be commented in your code for reference by future engineers and called out in your merge request so code reviewers can help verify your implementation looks correct.
Writing code is hard, but doing it correctly is significantly harder. What may seem like a sound implementation will not always produce correct results, so we take these precautionary steps, albeit with longer verifications, to provide the most functional and powerful statistical tool in experience management.