- XM for Customer Frontlines
  - Digital
    Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service
  - Care
    Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
  - Locations
    Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground
- XM for People Teams
  - Engage
    Know exactly how your people feel and empower managers to improve employee engagement, productivity, and retention
  - Lifecycle
    Take action in the moments that matter most along the employee journey and drive bottom line growth
  - Analytics
    Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
- XM for Strategy & Research
  - Research
    Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
  - User Experience
    Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
  - Brand
    Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
- XM Platform
  Explore the platform powering Experience Management
  - Free Account
  - Watch Demo
- Teams
  - For Digital
  - For Customer Care
  - For Human Resources
  - For Researchers
  - All Teams
- Industries
  - Healthcare
  - Education
  - Financial Services
  - Government
  - All Industries
- Popular Use Cases
  - Customer Experience
  - Employee Experience
  - Employee Exit Interviews
  - Net Promoter Score
  - Voice of Customer
- - Free Account
  - Watch Demo
- Customer
  - Customer Success Hub
  - Product Documentation
  - Training & Certification
  - Community
  - XM Institute
- Learn
  - Popular Resources
  - Customer Stories
  - Blog
  - XM Knowledge Base
- Company
  - About Us
  - Careers
  - Partnerships
  - Marketplace
  - X4 Summit
    The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.
- - Free Account
  - Watch Demo
CONTACT US
LOGIN
SUPPORT

Operations

Is your team’s new engineer ready to take on-call? Use wargames for training

Jeffrey Starr // February 3, 2017 // 7min read

At Qualtrics, our engineering organization is expanding rapidly. For my team, Text Analytics, we’ve gone from four to eight members in less than six months and there are more coming. As team members come in, part of the on-boarding process is preparing them to take on-call. The on-call engineer is the first responder for incidents -- responsible for either resolving the issue or escalating the issue if they need help. Culturally, we’ve decided that teams are responsible for the systems they build to promote ownership and solving of maintainability issues.

To be ready for on-call rotation, a team member must:

be familiar with our processes,
have certain accounts, permissions, and tools, and
be competent to diagnose and correct problems in our tech stack.

Although some of our new members who have transferred from other teams are well versed in the company’s on-call procedures, every team varies in their tech stack, so all engineers go through a learning process. We’ve been using wargames as part of this process to train new employees and stress test our processes and runbooks.

For the latest wargame, we established two teams: a red and blue team. The blue team was made up of three new team members who were charged to diagnose and correct problems as they arose. The red team consisted of our team lead who volunteered to act as the gremlin, causing issues. I acted as a judge and coordinator. I scheduled a conference room for a half-hour, but ultimately the games lasted longer than an hour because we were having a lot of fun.

The red team prepared six issues beforehand. The issues were selected from common or instructional incidents that had occurred over the past six months. The list was:

Issue and Resolution	Caused By
Service dies. Blue team restarts service.	`kill service`
Check disk alert; disk is low on space. Red team “hid” the file so blue team had to use tools (e.g. `ncdu`) to find it and delete it (after verifying it was safe to delete).	`fallocate -l 6g file`
A downstream service dies. Red team killed a downstream service whose health is reported in a higher-level service’s health. Restarting the higher-level service is insufficient for restoring health. Blue restarts the downstream service and the upstream service, restoring connectivity.	`kill service`
High CPU load. Blue team finds the offending process and kill it. (This was difficult to stimulate; automated processes killed fork bombs and handled some runaway queries automatically. Go tech ops for hardening our base systems!)	`dd if=/dev/zero of=/dev/null`
High count of 404 errors. This was to simulate an inconsistent issue (some servers had the file, some did not) that required understanding how files were served in our architecture. Blue team resyncs static assets.	`rm file`
High count of authentication errors, without alarms. This required the Blue team to trace authentication errors between services and, once they found the root issue (a dead service), determine why alarms hadn’t fired, driving understanding of how monitoring and alerts travel among our systems.	`stash alert kill service`

During the wargame, the blue team rotated on-call, but collectively dug into each issue as they arose. To maintain some idea of what the leader was doing, we shared their desktop onto the conference room’s display.

The first few scenarios went smoothly. In these scenarios (e.g. service dead), the alerts align with the corrective action and the blue team was familiar with the core commands to restore health based on their regular development activities. They had a more difficult time solving a downstream or “deep” health check problem. For one, the alert looks very similar to a regular health check alert (the “deep” is buried at the end of a long string). Secondly, once they checked the logs, they knew it was a downstream service, but didn’t know which one. (Clearer logs became a story for our backlog.)

By the time we got to the complex scenarios, the team had the service documentation, run books, and architecture diagrams up, so they could quickly run through a number of diagnostics and explore multiple options in parallel. However, since the scenarios started focusing on second-order effects or problems that were more infrastructure driven than in our own code, the blue team was still challenged. The training moved away from understanding processes and our architecture and became focused on troubleshooting tools and system reasoning.

Lessons Learned

We learned a few things right away: one engineer’s phone was dead and another had alerts setup incorrectly. Although we annotate most of our alerts with links to a runbook with instructions on how to correct issues, we had failed to tell the new engineers about their existence -- a hole in our training.

Also, sharing desktops did not work well; the activity was too fluid and detailed to track on a single display. We will probably simulate a call leader scenario next time, with the team adopting leadership and scribe roles to better simulate larger incidents.

Within their local development environments, our newly hired engineers were experienced troubleshooters, but it turned out this knowledge did not generalize well to production. For example, restarting services in production is a different process than within docker-machine. Files are served from a local directory in development but are served from linked data containers in production. Their difficulties showed holes in our documentation and training.

Because all three engineers were working to diagnose and correct the issue at the same time, activity was hectic and communication broke down. Since the on-call engineer can correct most issues by themselves, this isn’t normally an issue, but it does indicate we may need to spend more time establishing structured processes for handling communication and coordination in the event of a major incident. In fact, later experience led us to develop a wargame specifically around these incident roles.

Finally, the blue team realized (but did not take advantage of the fact) that the actions of the red team could be tracked by the command line history. If we have a more nefarious group of hires, we may need to find a way to cheat the auditing systems.

Closing Thoughts

With a few collective hours of work, we were able to provide training and a stress test for our on-call engineers. It was far more enjoyable and engaging than a training session and revealed problems that would only appear under stress. Wargaming can be a powerful technique for on-boarding and maintaining a team’s ability to respond to issues as they arise. W.O.P.R. was wrong; the winning move is to play.

Jeffrey Starr

October 16, 2019

Agile

Is your team’s new engineer ready to take on-call? Use wargames for training

Lessons Learned

Closing Thoughts

Related Articles

Running a Scalability & Resiliency Program

Writing a Framework for Custom ETL Automations (pt. 2 of 2)

Writing a Framework for Custom ETL Automations (pt. 1 of 2)

Streamlining Chaos to Lead with Intent

Minimizing Operational Cost of Inherited Services

Build Isolation at Qualtrics

Docker Exec and Maven AppAssembler

Integrating into Qualtrics: Docker Deployment

Support

Company

Resources