Minimizing Operational Cost of Inherited Services

Qualtrics is growing at a rapid rate, both in terms of our customer base and in the amount of data we process on a daily basis. To keep up with the demand, we have been increasing our engineering workforce at a similar rate. With so much growth, it is inevitable that new teams will be formed, and that these teams will be asked to take ownership over existing product and platform features. How do new teams get ready to handle the operational demands for these inherited product and platform features? At Qualtrics, we’ve tackled this problem by having a checklist of items your inherited service should have so that the team can tackle any customer issues in the best way possible.

While we won’t go into exhaustive detail about what is in this checklist, we will cover a few of the most important top-level areas.

Service Checklist

Metrics

Testing

Deployment

War Games

Runbook

Metrics

First and foremost, a service must have proper logging and alerting set up so engineers will be notified in the event of an availability blip. Logging should occur across all external calls to the service, as well as to dependencies the service relies on. This can be for both informational and errors, but we recommend at a minimum to track when errors occur for service debugging. To make it easier to query logs, we use Sumologic to filter log messages down to what we need. From there, we can create graphs and tables to illustrate customer usage of the service.

Another category of metrics that should be tracked is the health of the server on which the service resides. We need these healthchecks for situations where issues with network or memory arise and the service’s logs on customer usage may not fully illustrate why customers may be experiencing degraded performance. We use statsd, KairosDB, and Grafana to create a data lake of all our servers, which we then filter out into dashboards that represent service availability. We have found Grafana to be especially useful for internal teams to monitor individual call usages and discover bottleneck functions for improvement. These dashboards are commonly seen displayed on team TVs as a visual way to monitor service health.

While both of logging and healthcheck metrics yield nice visual representations of service health, the real benefit from tracking these metrics is the ability to add alerts for when these values fall above or below a certain threshold. For example, when 4xx and 5xx errors increase dramatically our on-call engineers are alerted immediately to handle the issue.

Testing

To ensure that new changes from engineers not yet familiar with the codebase will not break existing functionality, a robust test framework must be in place. While we encourage new code to follow the test-driven development model, good testing may not exist for legacy codebases. As a result, it is crucial to check that an inherited service has the proper unit and integration tests implemented.

Unit tests are written to ensure utilities written maintain the expected input and output as the original owner intended. They should be well isolated such that changes to other files within the codebase do not affect the particular test.

Integration tests are used to ensure a service’s external connections, such as a database or other internal services, are implemented properly. The types you see in your service depend highly on the application of the service being inherited. If the service is user facing, we run our typical MEAN stack against a Selenium end-to-end framework. More information about how to implement such for your own app is listed in this prior blog post. On the other hand, if the service is internal facing, integration test implementation is more team-dependent. Your job as an inheriting team member is to determine what that implementation is and add to it as functionality grows.

Note when dealing with a limited timeline, the priority should be to always have more unit tests than integration tests implemented in your app, new or legacy. This methodology is known as the test pyramid. Unit tests offer a faster feedback loop since smaller portions of the codebase are tested in isolation. Integration tests usually require connecting with and/or mocking external dependencies that may be flaky or assume certain conditions that may not identically match your production environment.

Another type of testing not yet mentioned is load/performance testing. Especially with services that have an existing customer base, inheriting teams often are unaware of the stresses that the current users are placing on the service. An upcoming post will explain how we have decided to handle this scenario, but note that this unknown can be easily resolved if the proper metrics, mentioned in the section above, are implemented to alert engineers when things start going wrong.

Deployment

The most important part of the service transition is to have the deployment steps listed out. As listed in our build isolation post, we use Jenkins jobs to test our dockerized implementation before deployment. Changes are usually listed in a CHANGELOG.md file within the repository itself for other engineers to see the type of change occurring to match with the version of the Docker container deployed on production machines. A future post will explain the reasoning behind this in more detail, but this level of diligence around change management is especially important when things go wrong since it allows rollback to the previous version. Debugging becomes simpler because we can isolate our focus on the changes since the last stable release.

It is important to note that deployments will have an impact on the metrics recorded for the service. Since metrics can be filtered to particular servers, there will be intentional downtime when a new version is being deployed. To handle this we use 2 methods:

  1. Stashing alerts
  2. Drain and Flip pipeline

For alerting, we use Sensu to let us know when our Docker containers are down. To silence them for the duration of our deployment, we run the following commands:

This silences the alerting for 100 seconds while the deployment is occurring and manually removes the stash when completed. If your service takes longer than 100 seconds to deploy, feel free to update that number to whatever suits the service best. Also note that the manual removal of the stash is necessary, since failing to do so can result in teams not catching real alerts.

The Drain and Flip pipeline, also known as Blue-Green Deployment, is a way we deploy services such that customers experience zero downtime. We deploy each instance of the service independently so that failed deployments are isolated to a particular instance that is unreachable by customers. Instances are first isolated and allowed to complete any pending requests. Then the instance is updated to the newer version of the service before rejoining the availability pool.

War Games

The above methods have all been about how to better prepare the codebase for the upcoming transition between teams, but neglects to consider the people involved during this move! Experienced engineers in the moving codebase should play an active part in training the “new” engineers to the existing processes. As a result, we use the same fun way we use to ramp new engineers onto our on-call process: war games. In these games, the experienced engineers create typical bugs and service blips for the newer engineers to solve in a timed manner. This simulates the on-call scenario within a controlled environment so engineers can figure out how a service runs without customers experiencing downtime. This allows the receiving team to learn about the codebase in a safe manner and provides both sides with confidence that future real incidents can be handled.

Runbook

Since the engineers from the old team cannot be relied on indefinitely, we turn to the last (and arguably most critical) part of the service transition: documentation on how the service operates. We call this form of documentation a runbook. A runbook takes all the concepts described above and puts them in a single location, usually as a RUNBOOK.md file within the repository itself or in a wiki page that is shared within the team. At a bare minimum this documentation will include:

  • Taking the tribal knowledge of experienced engineers used to create the war game scenarios and listing them out as common service bugs
  • Steps and links to deployment jobs across all data centers, including deployment cycle frequency (continuous vs scheduled)
  • Instructions for running the test frameworks and, if necessary, testing coverage reports
  • Metrics for the service and how to find them

Conclusion

Inheriting services is never an easy task. However, armed with the right tools, this operational transition can be made easier by implementing the steps above. These steps have certainly helped our services transition seamlessly between teams and we hope that it will help yours too.

Coreen Yuen
Software Engineer at Qualtrics
Coreen joined Qualtrics in July 2015. Coreen is passionate about making sense out of big data through visualizations and improving operational performance across applications. She has been in various roles over the years and most recently is the lead for the Product Experience team creating tools for product prioritization and insights.

Coreen graduated from the University of Washington with a Bachelor's Degree in Computer Science. Born and raised in Seattle, she enjoys swimming and reading in her free time.

You may also like...