Maintaining Availability in a $2.5B Company
How We Manage Availability
Qualtrics operates and develops products in a service oriented architecture. We went through the same evolution that many large companies go through - build a monolith application then break it apart into manageable chunks. But how do you maintain availability and performance with such a large scope of complexity? Short answer - visibility! You need to know what is going on in your product from the physical layer through the customer experience.
We build transparency into our product so that our engineers can monitor, maintain, and ensure the best possible customer experience for our market researchers. Our engineers work in a development operations (or DevOps) model where developers are responsible for the quality, uptime and performance of the product. Our DevOps team builds frameworks and tools that help our engineers quickly configure key monitors that not only watch if a system or service is healthy, but also directly monitor the customer experience.
So, we all know that we need monitoring and alerting. But what exactly does that mean? Can’t you just throw up a quick ping test in Nagios to make sure your server is alive? That’s a start, but we take it a step further. The key to ensuring the best customer experience is to understand the customer’s journey from beginning to end. It means not just knowing when a problem happens, but warning signs that there will be a problem. It means understanding the extent of the impact on our customer experience.
Defining the Stack
Let’s break this down. Any typical server/client architecture will have 3 major components: The edge devices, the application, and the platform infrastructure.
We call it ‘the edge’ because these are devices at the very edge of our network. This means your laptop or mobile device at home and the cable modem that your device is connected to. As an engineer, the edge is outside of our scope of control. It’s the wild. Once it’s out there, all we can do is observe. To monitor the edge, we rely on a combination of user synthetic (or simulated) monitoring tools such as Catchpoint, and application performance monitoring tools such as NewRelic. Catchpoint gives us the ability to simulate the user experience and alert if that experience is degraded in any way. NewRelic gives us visibility into to the application as it runs on a user’s browser. This is called Real User Monitoring (or RUM).
The application is what we manage. A respondent will access a survey on our system and in doing so, will access various code basis, application servers, databases, and content delivery programs. This is the core of what Qualtrics does. As such, it’s important to have a significant amount of data and visibility into what our application is doing on our hardware. The chances of us catching something are likely to happen within the boundaries of our product first. Some examples of tools we use here are MonYog and Sensu (an alternative to Nagios). Monyog gives us visibility into database health, which is a core piece of our product. Sensu is a host based monitoring application that can be configured to monitor everything from system health metrics such as load and memory utilization to application monitoring such as looking for an active Apache web server process.
Finally, The platform is what our product lives on. This is hardware, network, server racks, cooling, routers and switches, and the data lines by which the content is delivered. Internally, we call it oxygen. You can’t see it but you need it to live. If you don’t have it, you are in a critical condition. We use a combination of LibreNMS and Nagios to look for network errors, connectivity issues, traffic trends, and so on.
Tying It All Together
In a mature organization, you don’t just need monitoring and alerting tools such as those listed above but also solid processes in place in order to quickly react to problems. This is typically called an Incident Management process.
The process always starts with an event. Something breaks. Someone trips on a cable. A tornado wipes out a data-center. Barnaby chews on the network cable for the office (...don’t ask). This is where your monitoring tools, like the ones listed above, kick into play. They see something go wrong and they start alerting. Alerts go to engineering teams and on-call engineers in the form of automated emails and phone calls. The on-call engineer can then begin troubleshooting the problem. If the problem is big enough, we trigger an incident. At this point, the owner of the problem corrals any and all resources onto a call to work the problem until it’s fixed. Think red alert when the USS enterprise is getting attacked. Once the problem has been fixed, the team goes into a root cause analysis with a focus on preventing future failures and understanding the impact to our customers. We never want to see the same failure happen twice, so we use the opportunity to harden our product..
In coming posts we’ll dive into the nitty-gritty, like how to setup meaningful monitors, how to create alerts that are actionable, how to manage an incident, and how to conduct a productive root cause analysis. If you are interested in learning about how our Qualtrics engineers developed these tools and best practices, stay tuned for more!