Designing a Scalable Data Platform
October 14, 2016 // 20min read
Qualtrics is growing rapidly both in terms of our number of clients and their volume of data. With more data we can provide more powerful insights, but we face technical challenges because of the sheer volume of information we must process and analyze. With that in mind, we look towards increasing the scalability and dependability of our back-end systems.
This post is the first of several outlining Qualtrics’ approach to our data platform as we partake in this multi-year journey to build out its next generation data platform. Over the next few posts on this subject, we’ll touch on a few topics from building drop-in components, using test driven development, choosing scalable technologies, and how we’re building a platform that encourages innovation and accelerates development.
I want to tee up those posts by describing some of the common issues we’re looking to solve with data processing and back-end system development. In later sections of this article, I’ll start to break down the following topics along with some of our initial findings on each:
- Engineering for fault tolerance and product acceleration
- Reducing technological sprawl and knowing when to adopt a new technology
- Building systems engineered for changing markets
A few of these items will also be discussed in later posts as denoted within the breakdowns.
Even the best designed things can break
No amount of planning can prevent push-button-ocalypse where someone presses the wrong button and kicks over the ants-nest that was the driving force behind your data processing platform. Thoughtful engineering and design can help a lot, but the chaos monkey is an inevitable part of operating software and distributed systems.
Data platforms should be designed with fault tolerance in mind, but even the best laid plans can’t be perfect because you’ll spend your entire life chasing perfection right into the ground. There is a careful balance between perfection and enough for now, but where is it and is it set in stone?
This problem comes in many forms, and Qualtrics is not unfamiliar with it. With a rapidly growing client-base, a lot of new systems get built, and sometimes they don’t interact perfectly at first. Such is the life of a scrappy start-up style culture, but as we grow we strive to build systems to handle rapid worry-free prototyping while also allowing for safe keeping of your data and a strong customer experience.
Qualtrics’ clients depend on our services to maintain their SLA and they pay for that availability. We can’t have a system go down or be unavailable, but we also need to be able to test new systems and incrementally add functionality before it’s completed, so we need a process that allows us to play or experiment without breaking normal operation.
Dark launching is the process of launching or forking a production style workload without noticeably affecting your production load. There are many ways to skin that potato, but the end goal is to create an environment where others are encouraged to use a framework that mimics your production systems. By doing that, you do your best to guarantee that anything built should work with minimal impact while also enabling the ability to prototype out or test a new system with real production data. With our current data platform project, we’ve inserted a system between our authoritative persistence layer and our downstream data loading processes, so that we can build outwards from an insertion point while also making incremental technological progress.
The technical challenges of this include finding a reliable data bus that is highly available, inserting a traffic fork into the upstream process, and designing a plan for scaling the traffic to a full production load. All of those need to be tested and implemented with care so that we don’t affect our customers as we develop the shadow launch. We also need to burn in each of the technologies and systems as they scale so that we know how far they will scale.
Just like hardware undergoes “burn-in” testing to find immediate flaws in the circuitry, software should be tested for flaws in a production-like environment. Burn in is a common practice, but something that is frequently overlooked. Unfortunately, it normally either manifests as a glorified unit test or thought of as a cumbersome expensive loading system. From our perspective, burn in means we test what we can allot resources for, document what data we have, and set limits in place to notify us when our assertions are violated. The goal is to provide common sense boundaries and enough run-way to scale or replace a component that is reaching the limits of your previous test. In our data platform, we need all new systems to support the load we expect much further down the line, but we don’t need to test their theoretical limits to know that the choice is justified and solves real problems.
A message bus, a highly available queueing system, is a great example of a system that can be tested far beyond where you’ll ever need it, and our data platform is running through the practice of burning in a test system with our production type traffic. When we developed the burn in plan for our dark launch, we tried to put some common sense limits on the hardware we used as well as what we were testing, which I’ve described briefly below.
We are taking our current traffic patterns and plan to adjust the volume and frequency of that load to many times larger than we expect in the near future with the intent of establishing a set of known limits even if they aren’t the absolute limits of the system. If we know that the system requires a certain number of resources and we test it to ten or twenty times the load we expect without issue, we will set a critical threshold at the maximum we tested and a warning below that so we have ample time to make changes or scale before it becomes difficult. We also don’t have to take the system to its absolute limits or test with far more hardware than we deem necessary saving us both time and money.
That philosophy is being applied to many of the projects on our data platform from data stores to processing frameworks. We’re testing things well beyond our use case, but using common sense limits on the scale knowing we have notifications in place when we approach the limits we’ve recorded. This leads to us having lean services that are well known and predictable, but it also enables us to reduce the time to prototype or build out a new system because we don’t spend excruciatingly long periods of time testing to absurd limits.
Prototyping into production
Once we’ve built services on the shadow launching bus and tested for its limits, how long should it take to put services built on it into production? A lot of that depends on how the system is built, how production-like the software is, and what experience you want your customers to have.
At Qualtrics, we want software that accelerates development and increases reliability. This means we build software with testing and predictability in mind, but that’s easier said than done. Our data platform is developing new components that are intentionally decoupled. If each component has the minimal number of steps to complete a transformation or computation, testing is normally straight forward; furthermore, in a shadow launching environment load testing can be quickly scaled to a production load and components can be verified alongside production components on separate branches which can rule out many issues before the component is pushed in front of customers. The next challenge is selecting technologies that work well with component based architectures.
Technology can be shiny
Qualtrics is primarily a data company, and no one likes slow data. But when you have mounds of it and those mounds are quickly becoming mountains, you need the right tool to find what you want out of the raw data. The landscape is littered with shiny technologies, but not every material requires a new tool and not every tool needs to be perfect at handling all materials, so how do we differentiate and pick the right ones?
The best answer to that question is to get data and analyze it. Qualtrics believes that data is paramount, and when we research a new technology, whether it be a suggestion or a shiny Northstar in the distance, we prefer to take the methodology of picking things that are backed by data.
One recent choice we made was which eventbus our team should use out of the many available. There are so many messaging technologies that picking one relies more upon you knowing your goals then what they provide because they all do one thing better than everything else and that one thing has to be the cornerstone of what you’re using it for.
We chose Apache Kafka (http://kafka.apache.org/) because it provides a durable publisher/subscriber bus that we can leverage for many of our products, which is a necessity for our newer designs. Qualtrics has a lot of products, systems that manage those products, and ideas for future products. The ability to manage all of those systems as separate entities benefits our teams greatly, and the product itself gives us a few things that we really need like predictability and replay. We also wanted to develop our systems to be as technologically independent as possible. If we have to change technologies or providers, we don’t want to have to completely rebuild our systems which is best facilitated by building simple components and avoiding proprietary solutions where possible. The next challenge is actually adapting when you need to pivot or make a change.
Industries and products change with the wind
Customer obsession is a key to success at Qualtrics, but predicting where all of your customers will go is complex and normally requires acrobatics. This is not because all of your customers themselves are constantly shifting, but the amalgamation of your customers across many industries have different needs as all of their industries are in varying states of flux. This produces a unique challenge for some companies: the need to build data models and systems that support general purpose solutions but can also support fine grain systems as well. We’re trying to tackle that system right now, but we’ve also done our darndest to make it easier on ourselves.
Data models are at the cornerstone of any data platform architecture, but they can be a weak point if they aren’t done correctly. Qualtrics came to a pivotal decision point where we needed to decide how we were going to process and store our data. Raw incoming data is useful for clients, but only once it’s stored in a flattened or normalized form. Our client-base varies drastically in terms of volume and complexity of research which means that their data volumes, access frequency, and the complexity of analysis may also change at a similar frequency between clients, so we needed to select a form that enabled and empowered that type of consumption model but also worked well with the type of data we received.
The end product was a flattened data schema that allowed us to create an initial transformation from any type of raw data and create a typed downstream format that was general purpose for all of our analytics. We call these fieldsets and they are the primary storage system for all of our analytics. The new system our data platform is envisioning utilizes this flattened data model as we look towards expanding our data storage to external data, complex analytics, and merging other types of data. All of these should operate on the same flattened set, and the open source and data processing communities lean heavily towards large flattened data sets for their operations.
The goal is to push this model into a service oriented architecture. Something that allows us to plug in and adapt services that utilize pieces or groups of data from the flattened data set. It also allows us to treat incoming data similarly and merge disparate data sources into data sets that can be compared and contrasted with generalized tools. This also opens up the door to jump into the larger world of real-time and asynchronous processing, data science and machine learning, and scalable cloud resources. Most of these hinge upon or are heavily influenced by sparse flattened data sets and columnar representations like Apache Parquet (https://parquet.apache.org/).
We’ve made a lot of progress on the topics I’ve discussed here and are pretty excited at where we are going. Once we get a few more of these systems connected and processing traffic, we’ll have a better picture of the final outcome, and we’re excited to share some of those results in future posts.