Running a Scalability & Resiliency Program
October 16, 2019 // 35min read
When working at an internet scale company, some of the most challenging problems we solve as engineers is making sure our services are flexible and performant enough to stand the test of scale. Too often, scalability and resiliency end up being themes considered after an incident occurs and customers already have felt the pain. These types of problems can also become painfully difficult to fix because it means the team has to slow down feature innovation which also hurts the business.
At Qualtrics, we found that running a team program that focuses on scalability and resiliency can proactively avoid customer outages. You can save your engineers the most work by integrating such a program early into development lifecycle of new products, but even large, well established products benefit with a proactive approach to scale. As an engineer on the Brand Experience (BX) team at Qualtrics, I recently had the opportunity to develop a program for my team as we worked toward releasing our first dedicated Brand Tracker product.
How do you run a program?
Problems like scalability and resiliency can command a lot of time from engineering teams just to maintain the status quo in our services. It’s often not reasonable to expect that engineers will drop everything to fix scale issues, and often teams will see diminishing returns on scale investments. The objective is then not to fix everything, but rather to study the state of the system and embed a scale-aware mindset into the engineering process for continuous improvement.
I started building BX’s program by defining what good scalability and resiliency looks like. At Qualtrics, we have a team dedicated to scale testing and tools that help other teams define their programs by publishing a grading system for teams to rate themselves on. Through the grading process, we’re able to define what constitutes an “MVP” for scalability and resiliency, and set criteria for what a more mature system would look like. The grading system doesn’t need to be complicated.
For example, here’s a modified version of the table we use to grade resiliency.
Resiliency grades are put in terms of development speeds - higher grades mean that the team is in a position to release quality software at a faster rate. In this way, we can objectify these kinds of improvements. Attention to scalability doesn’t compete for engineering bandwidth, it actually translates to engineering velocity. Before we could release Brand Tracker, our org set a requirement to achieve the “Crawl” grade for our services.
Start with Resiliency
Once I established criteria for a resiliency and scalability goal, I first chose to address resiliency. Since Brand Tracker was a newer product, we were most likely to run into issues with our key dependencies being unavailable.
The first step of resiliency hardening is to be fully aware of the architecture your team has built and its implications. For example, our Brand Experience project service is a central component that stores Brand Tracker project instances and performs operations on them on behalf of the user. In order to perform its duties, it has several dependencies that we can divide into two categories.
The critical dependencies of an application are the ones that the application cannot function without. If these dependencies are down, your app cannot recover from failures, and the user will perceive the impact.
The non-critical dependencies of an application are the ones that the application can still function without. Perhaps only a subset of that service’s functionality cannot recover or the app can gracefully degrade functionality such that the user may perceive partial or no impact.
Complicated systems will have a large number of services, but you should perform this dependency enumeration exercise on each individual component. Then, create a list of functionalities that the component provides. Functionalities typically map 1-1 to API requests, but they can sometimes be workflows involving several actions. If your service is a backend service behind several service layers, consider direct downstream dependencies as your users and describe what functionality you provide for them. Then, create a Failure Matrix between the dependencies and the functionalities.
|List User’s Projects||Critical - Sessions fail to authenticate||No Impact||No Impact|
|Create a Dashboard||Critical - Sessions fail to authenticate||Critical - Cannot create Dashboard||No Impact|
|Collaborate Survey||Critical - Sessions fail to authenticate||No Impact||Critical - Cannot collaborate survey|
Since our Dashboard Service and Survey Service are not responsible for all our functionality, we can mark these as non critical. In general, fewer dependencies will lead to better service resiliency. However, characterizing your dependencies in this way allows your team to discover the nature of your architecture and potentially notice areas for improvements.
I repeated the “Failure Matrix” exercise for a variety of other conditions:
- A dependency is very, very slow to respond, causing connections to be long lived
- A dependency service cluster has a few bad nodes, causing some requests to fail but others to succeed
- A dependency comes back online from a period of unresponsiveness
Considering a variety of situations allows you to further qualify the nature of each dependency. If you’re not sure what a service will do in certain situations, this is a good opportunity to investigate the code, talk to other developers on the team or do some quick manual testing. Make a hypothesis that you can test later.
Improve with Scalability
After defining key dependencies in our system, the next step was to investigate our architecture from the scalability perspective. Like resiliency, building our scalability picture is all about studying how our system should behave. In addition, we’re going to set standards on how it should behave.
I started with the functionalities I listed for each service in my resiliency thoughtwork. I grouped these items into rough categories of performance. If a request was UI-blocking and the user could immediately perceive latency, I labeled those functionalities as “low latency”. For other workflows that had higher latency tolerances, I label those at “moderate” and “high” latency items.
Low Latency Tolerance
Moderate Latency Tolerance
High Latency Tolerance
|Create a Dashboard - user waits until dashboard is created to continue working||List a user’s Projects - user can tolerate occasional latency since this is a more “passive” action||Collaborate a Survey - sharing access to a survey doesn’t need to immediately reflect, since the recipient usually accesses the survey separately|
It might be your instinct to put everything into the low or moderate latency columns, but having high latency workflows in the system is not a bad thing. When high latency is acceptable, the user is more tolerant to latency, which gives your service a buffer when dealing with load. You might also need more latency categories for your system - create these using your best judgement but keep the number categories small.
For each functionality in each latency category, we’re going to fill out this simple formula:
When doing ____, if more than X% of requests take longer than Y milliseconds for more than Z minutes we will consider the performance to be in the (warning|critical) zone.
Warning thresholds mark when a latency becomes high enough that customers may perceive that the product feels slow.
Critical thresholds mark when a latency becomes high enough that customers may feel the product has become unusable.
In order to do this we need to define a few things:
- What latency number counts as a “Warning” threshold?
- What latency number counts as a “Critical” threshold?
- How long can performance be degraded before customers notice?
- How many customers need to be affected to control for outliers?
This work is perhaps the hardest part of building a latency program. New services often don’t have enough organic production traffic to deduce acceptable latency metrics. Mature services may have traffic, but it can be difficult to imagine what the target should be if the system already suffers from latency issues.
At Qualtrics, we have adopted a significance standard of 5% and usually tune the degradation period based on how volatile a service’s performance can be. Remember that nothing is written in stone, you can always tune the formula to your system. Given this, I created another matrix relating my latency categories with thresholds helped me make that rough guess. This gets our latency picture started so that we can tune it using insights from weekly operational reviews.
Sometimes there’s a few specific functionalities that don’t fit neatly within a latency category. It’s ok to define custom thresholds for these - when getting started, however, I found it more approachable to stick to a few categories.
Seeing your system clearly
After building an analysis of the system, we need to add visibility so that we could evaluate the claims and hypothesis made. Mature systems may already have well-tuned instrumentation, but for new architectures like our Brand Tracker product, all of our new services required some instrumentation.
At Qualtrics we use the statsD protocol to collect metrics on events like request counts and downstream call latency. We send these via a telegraf agent and aggregate them in a Prometheus instance. We also have a system called “Alert on Metrics” where we can configure Prometheus queries to run at regular intervals, and send engineers alerts when a metric exceeds the configured threshold. We also use Sumologic to aggregate logs from our service and build queries and reports on the collected output.
The exercise of hooking these systems into our architecture and Ops processes may deserve its own blog post, but here I’d like to explore a question I asked myself when I started - What metrics do I need?
Some general tips:
- Instrument latency measurements for all the major functionalities you identified in the scalability and resiliency analysis. Measurements should be at the transaction level. For example, in Brand Tracker these included time to load a list of the user’s projects or the time required to import a dashboard.
- Instrument the outbound transactions a service makes with its critical dependencies. Both latency measurements and failed request counts are interesting.
- Instrument error counts on major functionalities, as well as total request counts.
Another piece of advice is to not have metrics for everything. This was an important tip that I found useful while building metrics coverage for the Brand Tracker project. Instrumentation for entire product can be a daunting task. For products like Brand Tracker, it’s hard to say what areas of the service need additional visibility. That’s why it’s important to follow a “cone” approach.
When designing your dashboards and aggregating instrumented metrics, try to build reports that report on a few “life-blood” metrics. These might be the functionalities that are most central to the service or that depend on the most dependencies. Create both volume counts for these metrics along with timing and error counts. For example, this is a panel on one of our SumoLogic dashboards for our GetProject API.
Here, I’ve timesliced the metrics by the same 30 minute time bucket and graphed volume, TP50 and TP95 of the requests over the same period. This way we can frame latency observations in the context of request volume and see how many users are affected by latency - the second graph shows how the majority of requests perform while the third graph shows us the worst 5% of our users’ experiences.
Our dashboard is made up of a handful of metrics. These metrics are telltale signs of performance in our service andare the ones most noticeable by customers. When we notice a poor experience in our top level metrics, we can dive deeper into the metrics by inspecting related and child requests that roll up into that metric to pinpoint the source of latency and errors. As we continue to operate our service, we might build additional dashboards that allow us to “zoom in” on a datapoint quickly. In this way we can manage the complexity of dashboards while retaining easy access to the details.
Testing validates assumptions
After going through the exercise of studying our services, tracking their dependencies, and benchmarking their performance, it was time to test our assumptions and see the outcomes through our metrics.
At Qualtrics we have a Scale test tools team. They are like a special task force charged with driving scale and resiliency quality across all our engineering teams. They explore and package tools for developers to use in order to instrument and test their systems, while leaving the actually quality execution to developers.
Our strategy for testing our services was to set up a “Gameday” where we could carry out a series of exercises on our product and services to see how our expectations hold up and uncover any issues with our service resiliency and scalability. Holding a gameday provides several benefits:
- We needed to use a shared development environment to run the tests. This can be disruptive to other engineers so it’s best to limit this kind of activity to a scheduled time and alert stakeholders who may be affected.
- For fairly new teams that have not yet matured to build scale and resiliency testing into their development lifecycle, it helps to knock out a bunch of testing all together, reducing the overhead of setting up and running each test.
- I treated the session as a learning opportunity to teach the team about the program I had been building. As a team we can learn what actually happens when our service encounters a problem and how we can identify it, improving the operational competency of the entire team.
To create a consistent flow of metrics to track our service performance, I set up basic scale tests with Artillery, a load testing tool maintained by our scale test tools team. The same tests can be used to stress test our core workflows, and having this test suite is a valuable first step to integrating stress testing into the release process.
To run resiliency testing, Qualtrics partners with a service called Gremlin that provides chaos engineering tools. Using Gremlin, we can carry out a variety of attacks on our nodes that simulate chaos in a datacenter. During our gameday, we were able to train the team in using this tool so that our team could independently run resiliency tests.
The gameday helped us clearly see gaps in our service architecture and operational response time. We were able to see that some of the alerts had been misconfigured and did not fire when expected. This was an extremely valuable insight to discover and resolve in a testing environment before we released the product to real customers.
Beyond the first test
Once we completed our first gameday, we ended up with a number of improvements that we could start making to services. Since software systems are ever evolving and business requirements are ever changing, it’s important to consider how to move forward with your scalability and resiliency program without stifling feature innovation.
As I mentioned at the beginning of this post, the objective of your program is to study the state of the system and embed a scale-aware mindset into the engineering process for continuous improvement. Building a highly scalable and resilient architecture is a marathon, not a sprint.
For our team, I created a plan for how our system should mature up the grading scale. Each of the criteria items for the higher-level grades were mapped to potential action items that our team could undertake. I then set high level objectives and a rough timeline for our team to achieve higher grades.
A good way to balance feature velocity with scale and resiliency investment is to tie that investment to your metrics. Your testing will reveal the limits of the system - it’s a good idea to track those metrics and set milestones for completing scale and resiliency investments to support traffic thresholds. For example, you may set a requirement for the team to achieve the next level of maturity before your total request volume reaches a certain threshold. This allows your team to be proactive about addressing scale issues while giving you breathing room to balance investment with feature work.
After running this program, it was easy for our team to feel confident as we approached our General Availability deadline. In the end, the best kind of release is the uneventful kind. However, if we do encounter issues, we have built up a framework of tools and awareness that we can leverage to provide the best possible experience to our customers without losing our sleep and we have a path forward to get even better over time.