Streamlining Chaos to Lead with Intent
May 29, 2019 // 30min read
Since the advent of cloud computing in 2006, the delivery of software solutions to customers has shifted from a multi-year journey to a daily activity. Tech companies, large and small, are moving faster and delivering more frequently because the bottleneck of shipping physical storage devices has been eliminated. To operate effectively, they have adopted the Agile methodology, which has become foundational and common sense to everyone in the tech industry.
The cloud made it significantly easier for customers to have access to software, but it also introduced a new challenge for software companies. Not only do they need to build the functionality at a high quality, but they also need to manage the health of that software and the physical infrastructure it runs on at a significantly higher scale.
If you are a software engineering manager working at a cloud company, I think you can relate to the importance of continuous individual and team growth to keep up with the growing pace of customer demand. Have you had to constantly respond to new “types” of work that unexpectedly kept coming up? Does it sometimes feel chaotic? Have you figured out how to manage change while consistently and predictably delivering on all your team objectives in a quarter or semester?
In this post, I will walk you through a process that I started on my team at Qualtrics. It has helped me embrace and streamline the chaos, so I can lead with intent, based on planned goals, and enable my team to deliver with minimal distractions.
Buckets for Balance
Qualtrics is a high growth cloud company. With growth in usage exceeding 40% year over year, and a transition in the type of customers from academic to corporate to enterprise to big enterprise across existing and new industry verticals, the only constant is change. The ability to scale and constantly aspire to exceed customer expectations means a lot of “new” type of work to manage, prioritize, and act upon.
In our organization, to stay balanced and focused, we categorize our engineering priorities into a few high level buckets. The methodology varies slightly between teams, but the intent is the same: delight and retain current customers by preventing or resolving gaps, and introduce new features to expand usage and gain new customers.
On my team, the high level buckets look like this:
|x%||Keep the Lights On (KTLO)||This is the bucket for responding to chaos and new types of urgent work, which is usually tied to an SLA.|
|y%||Continuous Improvement (CI)||
Upgrade the lights (UTL): Make incremental and non-reactive progress in operational excellence. Reduce or prevent technical debt.
Quality: Close quality gaps in test pass rate, unit tests, integration tests, E2E tests, release pipeline, tools, etc.
Metrics, experimentation to validate a hypothesis, UX improvements.
|z%||Product or Platform Features (F)||Initiatives we externally commit to.|
At Qualtrics, we commit to goals on a quarterly basis, so I’ll be describing the process with respect to that timeframe, but it should also work for semester time periods.
The percentage of time spent per high level bucket may vary from quarter to quarter. At the beginning of the quarter, based on product/system health metrics and team feedback, we map out week by week, how we want to spend our time. For example, say we have a team of 5 and want to spend extra time on operational and quality improvements. We decide to have a rotation for each team member for 2 weeks. The first week is on-call (KTLO), followed by another week of continuous improvement (CI), where engineers choose how they want to contribute (e.g., quality or operational excellence).
|Week 1||Week 2||Week 3||Week 4||Week 5||Week 6||Week 7||Week 8||Week 9||Week 10||Week 11||Week 12|
Taking vacation time into consideration, we end up with the following percentages of time spent per bucket.
|On call and SLA-bound tasks (KTLO)||20.00%|
|Continuous Improvement (CI)||18.33%|
|Product or Platform Features/Initiatives (F)||53.33%|
|Vacation Time (V)||8.33%|
In a later quarter, let's say our team grew from 5 to 8 team members. We may choose to do only one week of KTLO per engineer but spend one week in the quarter, as a whole team, focused on CI.
|Week 1||Week 2||Week 3||Week 4||Week 5||Week 6||Week 7||Week 8||Week 9||Week 10||Week 11||Week 12|
As a result, we would end up with the following time spent.
|On call and SLA-bound tasks (KTLO)||14.29%|
|Continuous Improvement (CI)||8.33%|
|Product or Platform Features/Initiatives (F)||84.52%|
|Vacation Time (V)||7.14%|
If our overall system health is great, we may choose to combine KTLO and CI in one bucket.
Why have buckets?
Our selection of buckets maps back to the following criteria:
- Important: Has clear measurable value and impact, and must be done sooner or later.
- Urgent: Something that has to be done soon, to prevent the unexpected and keep everyone happy.
- Known (expected): A result that the company aims to achieve. It is usually planned, has a target date, and is expected by stakeholders to be delivered on time.
|On call and SLA-bound tasks (KTLO)||Yes||Yes||No (chaos)|
|Continuous Improvement (CI)||Yes||No||Mixed|
|Product or Platform Features/Initiatives (F)||Yes||Depends on due date||Yes|
|Things to say "No" to||No||Yes||No (chaos)|
Needless to say, the urgent but not important bucket can be ignored and does not need to be tracked. The Phoenix Project teaches us that you win when you protect the organization from putting the meaningless work into the IT system. Having those buckets ensures that unexpected work doesn't eclipse time spent on feature development.
As mentioned earlier, chaos usually falls into one of the following three categories:
- Keep the lights on (KTLO). Urgent and important work, tied to SLA.
- Continuous improvement (CI). Important work that is not urgent.
- Unimportant work that can be deflected, deprioritized, or ignored.
So why should we streamline chaos?
When there are a lot of important and urgent issues coming your way, how can you make sure your team is working on the most important and urgent issues first?
The answer is clear. You focus on prioritization and get involved on a daily basis to constantly guide the team to work on the most important and urgent issues first.
But that takes time! What if there is a way to automate prioritizing these issues that keep flowing through the system?
Create Prioritized Lanes of Work for KTLO
If we classify issues into types of work, and set relative priorities for these types of work, then all we need to do is make sure that issues are classified, and the prioritization will happen automatically. When priorities change in your organization or based on input from your team, your discussion and energy will be spent on prioritizing a class of issues rather than one issue at a time.
Here is a simplified classification of the KTLO issues/work that come our way at Qualtrics, in order of priority.
|Incidents, critical features blocked, critical security issues||Drop everything and remediate the issue.|
|Customer reported issues that ran out of SLA||Issues that have gotten out of SLA.|
|Customer reported issues at risk of running out of SLA||Issues that are “x” days away from getting out of SLA.|
|Non critical features blocked, high priority security issues||
Customer pain and safety are very important to resolve.
SLA < “y” days
|RCAs out of SLA||RCAs that have gotten out of SLA|
|Root Cause Analysis (RCA)||Take the time to reflect on incidents by putting a timeline for an incident, assessing customer impact, answering “5 Why’s” to find root cause, and identify action items.|
|RCA Action Items||
Take action to prevent incidents from happening again.
SLA < “z” days to close all blocking action items
By funneling each classification into a swimlane in a Kanban board, we now have predefined expectations of what the priorities are (top → bottom). This will help individual contributors save time and jump right into the next work item without having to worry about embracing all the chaos, or having to wait on a team lead to make the call. In fact, this will educate the team and empower everyone to make the most impact on reducing customer pain.
Another useful feature of Kanban boards is to put limits on how many items can go in the WIP (Work In Progress) column. This will remind engineers that they’re picking up too much work and should focus first on finishing up their current tasks.
This matters because wait time to do something depends on resource utilization.
Wait Time = % of resource busy / % of resource time idle
If a team is 90% busy, then wait time is 90% / 10% = 9 units of time, i.e. 9 hours. If a team is 50% busy, then wait time is 50% / 50% = 1 unit of time, i.e. 1 hour.
With the above process in place, we have been able to move from constant randomization and occasionally missing SLAs to being focused on initiatives and rarely missing SLA’s.
Lanes of Work for Continuous Improvement
We leverage a separate board for all other unplanned work items that are not tied to an SLA. When on call week is not heavy with KTLO work, engineers can go into the next week of the rotation and pick things that they want to improve. For example, they may choose to fill a unit test gap, refactor some code, add metrics, reduce alert noise, etc.
However, at times when KTLO overflows, the CI time should be used for KTLO.
Leading with Intent
At Qualtrics, we leverage OKRs (Objectives and key results) every quarter to set and align goals across teams. An objective is a clearly defined goal, and key results are specific measures used to track the achievement of that goal. Objectives are finalized by managers based on collaboration with PMs and team members. Team members take ownership of specific objectives or key results. OKRs are very effective tools to help team members understand their goals for the quarter, see how they’re doing against them, and also see how they align with the rest of the company. Read Measure What Matters to learn more about OKRs.
As we set OKRs for the quarter, we leverage our roadmap, which is a list of initiatives that can be completed in one or multiple quarters.
So if initiatives can span more than a quarter, how do we track them?
We leverage a Kanban board to track “active” initiatives. Those are initiatives that we’re actively working on in the quarter. On a regular basis, leadership reviews how we’re doing on these initiatives and if we’re on track to deliver on time. This helps raise risks early on so that we can mitigate them.
So how do we know when an initiative should be completed?
Before committing a date for a specific initiative, an engineer investigates and puts together a design that gets reviewed. After the design phase, we divide the initiative into tasks, including operations and test automation.
By leveraging the team schedule that we created previously, we can determine the number of resources to use to deliver the initiative for early adoption. We buffer about 2 weeks for stabilization and release, then commit to the early adoption date. We do our best to match team members with the areas they like to work on. In fact, this process of matching starts a month ahead of the quarter to make sure we are ready to get moving fast at the beginning of each quarter.
Each initiative has tasks, and one or more team members have key results to complete objectives associated with that initiative during that quarter.
How do we execute on those actual tasks in our day to day?
For that, we leverage scrum to execute on the tasks for a set of initiatives. We have sprints of two weeks. Before sprint planning, project leads (owners of objectives) groom their initiative and have the next most important tasks at the top of the backlog. During sprint planning, the team gets together and kicks off the sprint. We finish the sprint with team retrospective followed by a sprint report that showcases sprints goals, highlights, lowlights, lessons learned and shoutouts.
By keeping unexpected work from flowing into the sprint, we guarantee we’ll spend the time planned to work on team initiatives. If we’ve done scoping right, we hit our target date, and that happens most of the time.
Finally, if you want to focus on only one takeaway, and if you have SLA’s to meet on certain types of work that are usually not planned, I would highly recommend that you leverage swimlanes in a Kanban board. This will automatically categorize and prioritize new work, so you as a manager don’t have to be the bottleneck for prioritizing every type of work that flows into the system. Instead, the transparency this process brings will empower individual contributors to know what is important and why. With that chaos managed, you gain confidence that your team will be able to execute on initiatives as planned and without unexpected disruptions.
In summary, here is a refresher on each bucket, what agile method to use, and who on the team it affects.
|Keep the Lights On (KTLO)||Kanban||Primary on call engineer (first week of rotation)
Secondary on call engineer (second week of rotation) if there is an overflow of KTLO work.
Leadership to quickly detect if there’s anything at risk.
|Continuous Improvement (CI)||Kanban||Secondary on call engineer. Primary on call engineer if KTLO is low.|
|Features and initiatives (F)||Kanban||Leadership and stakeholders to track the status of initiatives.|
|Features and initiatives (F)||Scrum||Whole team to execute on stories aligned with team goals and initiatives.|