Lessons learned from a large-scale data migration
Qualtrics helps customers collect and analyze data so they can make optimal business decisions. A core component of this is giving customers the ability to create, deploy, manage, analyze, and follow-up on surveys. These surveys help companies do things like gather market research data, gauge employee engagement, and find out why customers leave an online store without purchasing the items in their cart, to name a few.
Surveys by themselves are relatively small in terms of storage space: think the equivalent of a single Word document with a list of questions. But when you’ve got airlines sending surveys to all their millions of passengers each year, the responses quickly add up. A few million customers answering 10-20 questions each means you have to think more seriously about the best way to store and handle a deluge of survey response data. As databases grow, solutions that worked at a smaller scale no longer function efficiently and it’s necessary to find new storage methods. While it’s great to find a new storage solution, changing often means you have to migrate your entire data set from the old store to the new one.
I’m an engineer on the team responsible for managing the storage of our survey response data, and we recently had to do one of these large-scale data migrations. There were good reasons to migrate and we ended up in a better place, but it was a long process that involved dealing with a large number of problems. Some of these problems could have been avoided if we knew then what we know now.
Like the wizened grandparent telling cautionary tales, this post describes some lessons learned the hard way. I share these in the hope that I can assist others working on similar projects and perhaps guide them away from making the same mistakes.
At its inception back in 2002, Qualtrics relied on MySQL to store all its data. MySQL is a great technology for a fledgling company because it’s relatively flexible and allows for easy querying based on essentially any field in any table. However, MySQL isn’t very scalable. When a MySQL database grows past a certain size, queries become exponentially slower.
There are a few different ways to solve this problem without switching to a different database technology. One of the most well-known methods is called sharding, a process that splits your large database into multiple smaller ones - each having the same schema but only a portion of the data. This is an approach we’ve historically used at Qualtrics and it’s worked well for quite some time. It’s nice to not have to switch all the data to another database, but sharding introduces an extra layer of complexity. Once the sharding is complete you always have to look up which shard to query before actually querying the data you want.
As Qualtrics has grown, gained more customers and stored more and more survey responses, we’ve had to make more and more shards, each one introducing more complexity. Eventually, the operational load of managing all these different shards, and the lookup tables that index into them, reached a frenzied tempo. We soon realized that continuing to stick to MySQL would give us more pain than it was worth. We wanted to move to a different, more scalable solution.
On top of its scalability and speed, DynamoDB is managed by AWS. While AWS has to do a similar sharding process to what we had done before, it hides all the operational details from its users and presents us a simplified interface. This means we can focus on application code, rather than having to worry about needing to create a new database shard every few months.
In order to use this new response store however, we first needed to go through the formidable process of migrating our entire data set from MySQL to DynamoDB.
Challenges of migrating from MySQL to DynamoDB
Migrating between two instances of the same database (i.e. from one MySQL database to another) is quite easy. After you’ve set up the schema in the destination database to match the schema in the source database, all the data can simply be copied over. Most common databases even have tools specifically for this purpose, such as MySQL’s mysqldump or Couchbase’s XDCR.
However, when your destination and source databases are different enough that the schemas don’t exactly match, the process requires an extra translation step between reading from the source and writing to the destination. For us, this was a one-off migration, but in general these types of data transformations are known as ETL (extract, transform, load) processes.
In our case, the Data Sources are various MySQL databases, the ETL process is our migrator, and the Data Warehouse is DynamoDB.
In theory, this translation step is pretty straightforward -- read each survey response from MySQL, reformat it slightly, and feed it into DynamoDB. Sounds simple enough, right? However, the devil is in the details, and there were certainly plenty of devils in the details of this migration. We had issues with wildly varying record sizes, non-unique GUIDs, and timezones, to name a few.
In addition to the challenges we had with the data format itself, our initial design of the migrator service itself turned out to be overly simplistic in general. This resulted in a lot of issues for us to deal with later. Below are some of the lessons we learned throughout this long process.
Lesson #1: Test performance early & parallelize the migrator
Our first iteration of the migrator performed our 3 steps (read, translate, and write). The simplest approach in my mind from the beginning was to do these steps serially -- that is, a single thread would read a small batch of data from MySQL, do a little processing, and then store the result in DynamoDB. The processing was pretty simple, and involved a bunch of little things like converting datetimes to the right timezone, consolidating a few columns into a single JSON blob, and so forth.
This code was indeed simple, but it was incredibly slow. The latency reading from the source database and writing to the destination database was low enough on its own, but it really added up between batches.
For example, we were limited (for database load reasons) to reading around 50 records at a time from the source MySQL databases. If this took one second (which unfortunately became commonplace for larger datasets), and it took us another 250ms to process the data and post it to DynamoDB, we’d only be migrating at a rate of ~40 items per second.
That doesn’t sound too slow without context, but for one of our (medium-sized) data sets with “only” 1 million items, that translated to about 7 hours, the majority of which is spent waiting for network calls to return. If, like in our case, the total data set consists of a few billion of these items, the full migration would’ve taken over a year -- and that’s just the processing time! Taking into account finding & fixing bugs, performing re-migrations, and dealing with performance degradations associated with larger datasets, that time would balloon even further.
However, this latency problem goes away if we parallelize these tasks. By running a bunch of jobs in parallel, we were able to achieve a much higher minimum throughput level of around 300 items per second, or about 11 weeks for 2 billion items. Although that’s still a long processing time, that’s much more tenable.
In hindsight, though, I wish we’d tested a small batch and done the math. This would have given us an idea of the time required and helped us think of parallelization earlier. Even though we eventually made it work, there were several design decisions made early on about the migrator’s architecture that made it harder to introduce concurrency later. Throughout the rest of the project, the parallel processing never felt quite natural -- it always felt more or less tacked on as an afterthought.
Lesson #2: Interview multiple tool options before starting
In keeping with the theme of a simplistic architectural approach, we opted to use a familiar programming language for ease of adoption when designing the migrator tool. In doing this, we overvalued the importance of familiarity, and undervalued the benefits other languages had to offer.
In this case, we went with Java (using the Play framework) for the migrator, both of which our team had recently used on another project. For those unfamiliar with Play, it’s a lightweight server framework that is good for creating things like webapps and API’s, but really wasn’t designed for creating asynchronous, lightning-fast, super-scalable job workers. Java as a programming language is also disadvantageous for migrations because:
- Parallelization in Java requires more boilerplate code than other languages, and
- Java doesn’t use UTF-8 as its default character set, so whenever string IO is performed with customer data, you have to be careful to specifically convert to and from UTF-8.
Looking back, it seems like Go would’ve been a much better choice for the project. Not only is it specifically designed for easily concurrent programming, it’s simple and powerful enough to be able to provide a rudimentary HTTP server without installing a whole framework.
As a last word on this lesson, I want to say that it’s not necessarily important to stress a lot about the differences between programming languages, because it’s certainly still possible to overthink on this point. It’s just that in our case, we didn’t really think about it at all. We thought the project would be easy and quick so we chose the first programming language that came to mind. But as usual in any coding project, the right approach involves some mix of planning and jumping in.
Lesson #3: Migrate directly from the database instead of an API
Some of the MySQL data we needed to migrate was actually hidden behind an API -- that is, we made an API call to read the data out, rather than talking directly to the databases. In some ways it seemed like a good idea, and in fact it was easier at first than talking to the database since the format was so standardized.
However, we quickly found that whenever the migrator ran into a special case (which was unexpectedly often), we usually had to make a change to the API as well as to our migrator. Coordinating changes between two codebases is hard enough on its own, but adding the fact that the service whose API we were using had a long, tightly controlled release cycle and a pretty labor-intensive process for making any hotfix meant that any change resulted in a delay of anywhere from a few days to two weeks.
In our case, we were already sufficiently invested in pulling from the API such that this was still quicker than building a whole new solution pulling straight from the database. But if we had to do the whole thing over again, it would’ve been significantly less work if we just pulled from the databases directly.
As with many software projects, the initial complexity estimate for our migration turned out to be way off. Jumping right in with a naive plan ended up costing us a good deal of effort down the road, and sitting down to better plan out the scale of the project would have helped it go more smoothly.
More importantly, though, we learned from our mistakes. Now that we’ve worked through a large project like this, we’re able to remember the lessons we learned and take steps to avoid falling into the same potholes in the future. I hope this article will help you do the same.