Our Careers: Scalability != Efficiency
The first paper I ever published was at SC2001 and I’ve been working with scalability issues in distributed systems and organizations of all sizes ever since. Over all those years, the one thing that keeps coming back is what my professor told me about scalable systems in my first parallel and distributed systems class in grad-school. Before we get to that quote, I want to set the stage with some context and a technical subtlety that often gets overlooked: In real life systems, we often can not have both efficiency and scalability.
Understanding why these two are at odds will be crucial to our careers, even if our engineering domain doesn’t require any knowledge of scalability.
Recap of the Basics
Let’s start with clarifying three well known concepts:
This is some flavor of: “Doing more with less.” As my code iterates across the horizontal axis, I can shave off fat here and there to get better performance. As software engineers, there’s a part of us that secretly loves doing this.
Efficiency could be graphed with the horizontal axis being code iterations (and of course, I’m always trimming the fat, right?) and the vertical is “how much work my service can get done given constant resources” - so this starts out steeper as I made the big differences up front, but tapers off as there becomes less fat to trim.
My favorite definition is: “scalability == more money fixes my problem.” Often this looks like “if I buy more servers and that fixes the increased demand, then I’ve got a sufficiently scalable solution.” Of course we know that in real life that’s often easier said than done.
One common graph for this is speedup: given n nodes (the x-axis), how much faster am I than 1 node? By default, we assume the ideal speedup curve is the perfect identity line (the red dotted line in the image): for n cores I can get a speedup of n. Termed “embarrassingly parallel,” there are some problems in practice that can achieve this.
The image includes a green line, one even better than perfect. This is superlinear speedup, which actually happens for some special cases in real life. This is due to an increase of n changing some fundamental characteristics of the machine, “changing its nature.” One example is the changing ratio of cache size to problem size, which leads to speedup better than n.
The purple line is bad. The overhead for adding more machines is more costly than the benefit and in the extreme it gets to the point of doing worse than a single machine would do. This is the case for many interesting real-world problems.
In practical engineering, we’re realistically shooting for the blue line.
The “Hockey Stick”
Classic hockey stick chart: starts slow, then elbows and starts to climb steeply. If this is revenue over time, then investors get giddy. Most everyone secretly wants their startup to do this at some point. Sometimes this is demands on my server over time, and it can be a little scary.
Comm-Check (‘cause we’re about to jump)
My professor in that grad-school course drilled this into us:
“For every non-embarrassingly parallel problem, your scalability bottleneck will always eventually be communication.”
It has astounded me how often this is true in practice.
What that means is that the only thing you can do to address scalability in the long term is improve communication. This includes how internal information is shared and accessed by your nodes (databases, caches, message queues, etc.) and how external information is distributed (crawlers, web services, services listening to your mobile apps, etc.).
Contrast that with efficiency: The list of things we know and love for improving efficiency is huge and significantly varied, but we need to observe something subtle here: most often, communication is not on that list. Communication is the only thing that helps scalability, while it rarely helps efficiency.
When deciding between efficiency and scalability, the right answer changes as you grow across that hockey stick. Given enough increase in demand, focusing on scalability will be the only way to stay on top.
All the Right Curves
Remember that speedup graph for scalability, above? That purple curve (the frowny shaped one that dips down far as it gets to the right) has a name: Brooke’s Law. Yes; I just totally context swapped.
We all know this, and hopefully we haven’t all experienced it: adding more people to an already late project (that’s not embarrassingly parallel) makes it take longer. Why? Communication.
At Qualtrics, I work as a single node in a graph of heterogeneous worker nodes - each person has different cache sizes, different skillsets, different implementations to different APIs. As a whole, we’re working to accomplish something. And just like the challenge of scalability in distributed systems:
The throttle on the scalability of an engineering org will always eventually be communication.
Communication will be our bottleneck.
Ask any modern startup investor what signs they look for in a business, and they will almost surely include some flavor of these two at the top of their list:
- A scalable business model
- A leadership team that has good working communication with each other.
So here’s the wizened advice: look at your career, and those careers around you, and those careers of engineers you respect all over the map, and observe this fact: Everyone whose career is on the right hand side of that hockey stick is optimizing their communication - that’s the only way to stay alive once you’ve reached that part of the curve.
When I started at Google in the Pittsburgh, PA office, there were about 30 people in that office. When I left, about 6 years later, there were about 300. I saw first hand how scalability related issues had some very real impacts on the culture, our productivity, and the business as a whole.
Because of that experience, when I was looking to come to Qualtrics, I did my homework on the leadership of the engineering organization, specifically in regards to their past experience with quickly growing organizations. Their answers were part of why I choose to come here.
What should you look for in an organization that indicates its scalability? Here are a few starter ideas:
- People and teams need to be able to be as autonomous as possible, meaning:
- They have low latency and high bandwidth access to the information they need: information to do their work, information to make decisions, information regarding priorities, etc.
- They need to know how to manage their communication and visibility of their work.
- They need to be trusted (and that implies: they need to be trustworthy and dependable).
- There needs to be a minimal amount of hops needed in order to find the right person to talk to about something.
- The work-to-be-done, and process around it, needs to be organized such that people and teams can run CPU-hot, so to speak. This includes reductions in context switching, interrupts, yanking chains, and the amount of time that people are I/O bound.
The list goes on, but I hope that conveys the gist.
The Superlinear Engineering Organization
Remember the superlinear part of that speedup graph? This is the magic that differentiates rockstar organizations from the rest: change the nature of the machine. Here are a few starter ideas:
- Bring diversity of thought and background onto your team.
- Foster a culture of “we are greater than the sum of our individuals.”
- Crave, and treasure, “Multipliers” - people who make those around them do better.
No engineering organization is perfect, and the scalability of an org changes frequently. I’m happy to say that here at Qualtrics we’re doing better than I’ve seen anywhere else. I see superlinear improvements here at times - because we’re bringing in people in that change the fundamental nature of our machine. We found that we were hindering autonomy by not providing more visibility into prioritizations. We’re finding practical (and successful) ways to address tribal knowledge.
Taking steps to change these has pushed our curves up. I see our leadership intentionally architecting the communication paths for scalability instead of efficiency. This is part of why we’re successfully growing as fast as we are. Yes it takes a team, but across the various places I’ve worked, I’ve seen orgs struggle to grow, and I’ve seen them succeed, I’ve seen people get left behind, and I’ve seen people pushing the right side of that hockey stick. Communication will be the bottleneck, at all levels of your machine.