Writing a Framework for Custom ETL Automations (pt. 1 of 2)
Dennis Callanan // September 24, 2019 // 12min read
Every customer has their own programs, technologies, business goals and requirements to fulfil. Meeting these requirements often means making our systems work in harmony with theirs through Extract, Transfer, Load (ETL) automations, where the output of one system feeds into the input of another. CRM is an example of a system which exchanges data with the Qualtrics platform. It can be challenging and expensive to write these custom integrations on a per-client basis. However, good engineering is all about identifying and abstracting the high-level similarities and offering flexibility for the low-level custom elements of a workflow.
In this post (and the next) we’re going to talk about the philosophy and design of the framework we developed internally for optimizing and streamlining the creation of custom ETL automations for clients.
Given the frequency of client requests, Engineering Services (EMEA) didn’t want to keep reinventing the wheel every time custom work was required (engineers hate that!). We wanted a robust, solution that was streamlined, could be extended for custom functionality, could be well maintained, and could enforce good standards. We wanted it to be developed to run reliably and predictably making life easier for different teams, engineers, partners, solution architects, and on-call engineers.
For part 1 I’m going to talk about about the many decisions that were made in order to achieve this goal. Specifically, I’ll focus on on architecture (AWS / Terraform), design (release engineering, project structure and design patterns), development philosophy, and automation.
An ETL automation, in the more specific context of Qualtrics Engineering Services, is:
An automation which takes input from a client, processes data through the Qualtrics systems and provides an output to the client, such that the client’s system and data model can integrate seamlessly with ours
Below is one such example.
Many automations follow this kind of workflow (such as response imports, metadata imports, etc), but most of them require non-trivial custom engineering at some point. This custom engineering may involve parsing and validating the input file, enriching the data, mapping the data (from an old vendor to Qualtrics), conditionally scheduling distributions, conditionally sampling contact data, and more. These types of automations, with custom code sprinkled throughout, were the target for this framework: we wanted to focus on custom automations that were beyond our ability to simply define through basic configuration files.
Ultimately, we wanted to use this framework to automate the generation of boilerplate projects (full source code repositories) so as to guide the developers into good standards and practices: encompassing infrastructure as code, binary/config packaging and release engineering.
So why impose certain standards for project structure and source code methodologies? First, we want to ensure consistency across projects; consistency leads to simpler maintenance, less ramp-up for handovers, faster comprehension for on-call engineers, and stronger vision and direction to guide projects to completion. Equally as important as consistency is speed: we wanted to cut design and development time from the induction of these projects.
And why a framework? Given the conceptual similarity of many ETL automations, it’s important to define these workflows carefully and abstractly with default implementation for the most common use cases. This way engineers do not have to repeat any tasks, and duplicate code isn’t scattered across project repositories. But we want to give developers the freedom to extend this default functionality, if needed. This has many benefits:
- It gives engineers the ability to engineer the parts of the solutions specific to their problem without having to reinvent the wheel
- It allows partners to easily develop solutions for clients, while enforcing good practices through a streamlined process
- It abstracts most of the common implementations which interact with Qualtrics APIs. This can be maintained to a high standard by Qualtrics Engineers (and if hypothetically API interfaces change, we need only reflect this update in one place).
- This streamlined process (through the use of the framework and project generator) will allow us to reduce design time for projects as well as reduce scoping time for solution architects, as the areas of custom development are more clearly determined by the state of the framework. This in turn leads to confidence and quick response time for client ETL requests.
Why did we choose to build this framework on top of AWS? In Qualtrics Engineering Services we use a standalone infrastructure model which provisions new AWS cloud environments on a per-client basis. This has a few benefits:
- It allows us to isolate infrastructure clearly for each independent client project
- It is easy to observe usage of computational resources and safely determine a pricing model
- It is horizontally scalable, secure, and requires less maintenance
- It allows developers to easily spin up infrastructural components which lend themselves well to such automations (e.g pipes of queues and microservices for ETL processes and retry logic).
Now, let’s take a look at the high-level design and architecture of this framework. To do this we’ll investigate the common workflows of ETL automations in more depth. An ETL Automation generally consists of the following stages:
- Pre-processing of data (e.g. parsing and validating from source)
- Processing of data (e.g. posting or retrieving data from API endpoints within the Qualtrics system)
- Post-processing of data (e.g. constructing responses and returning these to the client in some format)
The pre-processing stage may involve pulling the source data from a client’s SFTP Server (or S3 Bucket), or consuming a client’s POST request to an API endpoint. This stage subsequently parses, validates and formats this data, pushing it to the processing stage of the workflow. The post-processing stage essentially performs the same steps in the reverse direction (as an export to the client rather an import from the client).
In between, we have the processing stage, which uses the formatted data to perform tasks which interact with Qualtrics systems. An example of a step in this stage would be to import contacts into a mailing list. This takes contacts as an input, and outputs a mailing list ID. The next processing step would then take this mailing list ID as an input, schedule a distribution to that mailing list, and output the distribution ID.
It’s good practice to have a single Qualtrics API call associated with a single processing step. Why is this? Well, if one step fails using the output of a previous step (which was successful), we want to retry only the step that failed which will allow for idempotency across the general workflow, when including retry logic.
To make this concept more concrete, we will represent each step discussed above as an independent and well defined service in AWS. Each step consists of a microservice (Lambda) which processes the request input, and outputs the result to the next step (i.e next service).
Lambdas are very powerful for this use case.
- They are serverless and so require no maintenance, meaning developers do not need to worry about the services they consume in their projects
- They are easy to deploy and develop (through SLS)
- They are highly scalable and have retry logic, while having a simple relationship with queues (we’ll get there later)
- They conceptually represent the “pipe” workflow ETL automations are familiar with
- Tasks can be delegated to a swarm of asynchronous lambdas for batch processing
- Simple configuration can allow for synchronous consumption of requests from queues (for hitting non-concurrent APIs), or unlimited reserved concurrencies (for asynchronous APIs)
There are a few general pitfalls with Lambda however:
- The serverless nature means our automations should ideally not depend on databases, as up to 1000 asynchronous lambdas could fire off. Caching etc is not a realistic option. Lucky for us, we are just hitting our Qualtrics APIs (which take on all the heavy load) - so this is not a concern.
- Lambda functions may retry unintentionally. Therefore, each Lambda function should be idempotent. Once again, lucky for us, our APIs are built to handle this. For example, our Response Import API allows the user to pass an idempotency key to prevent importing identical responses in the case of an unintended retry. See Response Import docs.
In an ideal world we could join up microservices like this:
The problem with the above example, though, is that it doesn’t account for failed steps. For example, how our ETL behaves if the “Import Contacts” step failed? The Qualtrics API in which the Contact Importer step depends on may have high availability, but there is always the possibility of failure. Even worse, these automations often run on similar cron schedules hitting the same API endpoints, which can only increase the likelihood of failure. We can incorporate retry logic for each Lambda, but that doesn’t really scale laterally because of the aforementioned cron problem. So what do we do? In this case, we introduce queues (SQS), because queues can:
- Manage the batch sizes as data flows through the steps of the workflow
- Delegate batches to multiple asynchronous microservices
- Use DLQs (Dead letter Queues) for simpler alerting and management of failed automations for on-call engineers without killing the entire automation
In summary, our high-level visualization of services in relation to our example workflow looks something like this:
As we can see each service is represented by a Queue and a Microservice. This encompasses a single block / function, a reusable component which we can develop and maintain independently and inject into any project or workflow.
In this way, a developer can decide whether to incorporate certain services in their workflow. Do they need a data validator? Do they need a distribution scheduler? Do they need a response importer? Do they need to schedule multiple distributions at different times? Etc.
Finally we want to be able to make these services extensible. To do this each microservice (lambda) imports an abstract library which is maintained internally by Qualtrics. The classes / modules available through the library contain methods with default functionality, documented and exposed to the developer, such that they can be overridden to suit the developers requirements. These protected methods should have well defined interfaces and conceptually solid use cases to guide the developers properly. We’ll talk about this more in the implementation section.
Overall, this service-oriented architecture is the foundation of the framework.
In this post we talked about the design of a framework for developing custom ETL automations, and the reasoning behind it. In part 2 we'll dive deep into the more nitty gritty implementation details, in particular, Terraform and Serverless.