Writing a Framework for Custom ETL Automations (pt. 2 of 2)

This is part 2 of 2 on creating a framework for developing custom Extract, Transfer, Load (ETL) automations for Qualtrics clients.  In the previous post we talked about the design decisions around our framework and the reasoning behind those decisions. We finished by introducing our service-oriented architecture.  If you haven’t yet read the first part I recommend to do so before reading this one.

This section will be more complex as we dive deeper into the implementation, the obstacles we faced, and how we converted our great ideas into a functioning framework. If you’re technically inclined and curious about how to build maintainable projects on Amazon Web Services (AWS) infrastructure (or other cloud platforms) then stick around.

Service-oriented architecture is the core of our framework and the implementation of these services is the best place to start.

IMPLEMENTATION:

How do we go about defining the infrastructure and functionality of a single service that we can plug into any project? 

  1. Infrastructure 

Infrastructure needs to be provisioned as easily as possible.  Infrastructure-as-code (IAC) comes to the rescue here. It really is the solution to autonomous provisioning of infrastructure – saving time and automating tasks (the primary philosophy behind writing any code in the first place). AWS offers cloudformation as an IAC solution, where the stack is configured in Yaml in a procedural manner. But we decided to go with Terraform. Not only is it platform independent to avoid lock-in (in theory anyway), but Terraform files (written in HCL) have two other great benefits:

    • Declarative configuration: Terraform handles the ordering in which the infrastructural components are provisioned and destroyed based off dependencies, abstracting much of the pain of IAC away from the developer
    • Modules: Terraform modules allow us to build and define an isolated group of infrastructure with complex dependencies, which take simple input as configuration and  output the necessary infrastructure IDs. These infrastructural black boxes are perfect for our service-oriented architecture

Terraform modules use 3 files (variables.tf, main.tf and output.tf) for taking an input (project name, etc), provisioning the resources, and outputting the IDs (respectively). More information on Terraform modules can be found here.

The outputs of these modules are very useful. If we create a module representing the infrastructure of a service such as Contact Importer (which includes a microservice and queue as we know) we can output the necessary IDs so the framework responsible for deploying code so the provisioned infrastructure can directly reference these components through these IDs.

To store these IDs into a clean json file we can make use of the jq bash command

terraform output -json | jq ‘with_entries(.value |= .value)’ > infra.json

So now in our project Makefile we can run terraform apply followed by terraform output (above) on all the modules the project needs (one for each service). And these modules can be maintained by a specific team. 

We can see now, a terraform module is the infrastructure behind a given service.

Note: we can pass the stage like so: terraform apply -var ‘stage=${STAGE}’ and reference these as input variables for modules to create independent resources for each stage (test, dev, prod).

Finally, we don’t want to use local storage for storing the infrastructure state so we set up a backend S3 bucket for this purpose. Everytime we provision a new AWS environment for a client we include a standardized S3 bucket for this purpose. This allows us to have multiple developers or servers (e.g CI/CD) interacting with the same infrastructure. We use the backend-config property of terraform init for backend state configuration rather than inside the module, so we can handle dynamic environments more easily. 

Now we can simply reference infra.json (where the json output was redirected to) from any frameworks we use for deploying our source code (such as the serverless application framework) which we will discuss next.

 

2.Source Code deployment:

The serverless application framework (SLS) is responsible for deploying code to the Lambdas which it references through the output of terraform, once the infrastructure is provisioned.

Each service contains a single serverless.yml file which is mainly responsible for provisioning and configuring lambdas for deployment (but also may provision certain infrastructural components such as API certificates). For any global or service specific infrastructure provisioned by terraform, sls can reference these like so:

role: ${file(infra.json):infra_arn}

Now you can see why the jq bash command came in handy.

Examples of global infrastructure that every project contains (independent of the services it uses) are SNS (simple notification service) for handling alert-on-metrics and being able to send alerts to our internal systems, and a secret manager, for storing application secrets that Lambdas can consume.

To sum up what we’ve accomplished here. We started off with a conceptual idea of a service in the workflow of an ETL automation. Now we can define a full service through terraform and SLS which any AWS project can consume. 

Each of these services exist in their own git repositories. The repositories contain 2 important folders. One is the terraform module discussed above. The other is a client folder which any project can clone into their own repo to use the given service. The client folder contains the serverless file and lambda function for deployment purposes. It also contains a terraform file, which does nothing other than reference the terraform module (by git repo url) and provide basic input parameters / variables. Some of which it pulls from the project config (such as project_name), others that it consumes from the terraform commands (such as stage)

Now when a developer or partner wants to spin up a new project they simply look up which services they need in their project, pull them from their respective git repositories into the project. They now have services which they can individually create in their project through 2 simple commands, terraform apply, and sls deploy. An automation script can easily perform all this work too (which is what we did).

 

We haven’t talked yet about the actual code performing the processes which are core to each service. With reference to the project structure above, all the code would sit in the lambda function in the client directory. Although it’s nice to have a consistent language across different services, it’s technically not a requirement. For now, we stick to python for one important reason, that is because we maintain a library written in python which abstracts most of the functionality we will use in these services. 

This library is an abstraction on top of a Qualtrics SDK, which is in itself an abstract client to the Qualtrics API. In the future, using tools like Swagger on top of an OAS spec, we can generate API clients in any language and offer even more flexibility to developers.

This library also contains functionality for performing common tasks in AWS, such as interacting with Queues, pulling credentials from a secret manager, etc. Lastly it exposes well-defined functions which can be overridden by developers for performing custom automations.

These keep the Lambdas very small and concise. Ideally, each service shouldn’t have to exceed 20 lines of code. Below are some examples.\

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In the above diagram, the code highlighted in blue is the only custom code necessary for a developer to write in order to add custom functionality. The rest is basic boilerplate code for each function, and 90% of the code is abstracted into the libraries you can see imported.

This library is deployed to a Nexus repository manager which can easily be consumed by all services sitting in AWS. This library is maintained to a high standard with >90% test coverage.

Earlier, we mentioned that a terraform module is the infrastructure behind a given service. Well, the lambda function is the functionality behind a given service. Allowing the infrastructure and functionality to work in harmony in isolated repositories / components, which can be injected into any AWS project is what makes this framework so powerful.

The final piece of this framework, is the generation of a new boilerplate project. Kind of like npx create-react-app my-app for react.

In our case we use a global project configuration JSON for configuration such as project_name, region, custom_domain and services. Then we trigger the automation for generating a new project:

  • This will use a template to create a brand new project, with a boilerplate Readme, Runbook, Makefile, and CI/CD configuration
  • These elements should be fairly consistent across each project. For example, our service-oriented architecture lays down much of the Runbook for on-call engineers, particularly in relation to dead-letter queues, and how to re-run failed automations.
  • The services config defines a list of services (in a specified order, so we can link the outputs of one microservice to the inputs of the next queue)
  • The automation pulls those services from their respective git repo’s /client directories, placing the terraform file in a global infrastructure folder in the project repo, and the rest in a new /services/<service_name> directory (see serverless services)
  • Using serverless services and terraform modules in harmony is perfect for building out clean monorepos like this for client automations
  • The Makefile is prepopulated with common commands like make test (which will run all the default tests of each service), make deploy (which runs sls deploy on each service), make infra (which runs terraform apply on the infrastructure and outputs the IDs) and also our makefile will abstract the management of environment variables for stages (test, dev, prod).
  • These commands are then used in a default CI/CD file (e.g. GitlabCI, Jenkins, etc) for creating a test and build pipeline to enforce good standards for each project

 

EVOLUTION OF THIS FRAMEWORK: 

As long as we follow some standards in the creation of new services, this framework evolves efficiently and is easy to maintain:

  • Following a standard for directory structure and config file names used by the services (e.g infra.json, the /client and /tf_module folders, etc)
  • Ensuring each new service has tests for the default functionality. If the default functionality is changed by a developer in their project, overriding an internal method, then the tests should fail, and as a result, the build pipeline of the developer’s project should also fail . This forces developers to add new tests if they add any custom engineering to steps in the ETL workflow.
  • If a brand new ETL automation step which lends itself to this architecture has the potential to be reused in other projects, it should be added as a new service. This also ensures that developers maintain an intimate understanding of the implementation of this framework, so that they don’t become too reliant on the automations.

That was quite a lot to digest in a blog post.  To summarize the main technical takeaways from this project:

For serverless applications (and service oriented architecture) use infrastructure as code, and SLS in harmony. Join the two forces and join them well. Enforce good standards and abstract unnecessary elements from developers (but don’t abstract too much!). Expose what the developers may need to change and hide what the developers always need, but no more than that. Do not bloat a framework with unnecessary abstractions. Every engineer will at some point fall victim to the law of leaky abstractions and we want to minimize that. Antoine de Saint-Exupéry once said, “Elegant design is achieved not when there is nothing left to add, but when there is nothing left to take away”

Working on this project at Qualtrics not only confirmed my engineering instincts of abstraction and extensibility of code, but also gave me a great sense of satisfaction. Seeing a framework develop naturally, while a team spends fewer hours repeating mundane tasks and more hours engineering high quality solutions. This directly converts to savings on dev time and time responding to client requests. This is essential for the business to maintain customer satisfaction (we are in the experience economy after all) and to save costs in general. And that in itself is quite a pleasant experience.

Dennis Callanan
Dennis is a Software Engineer in Qualtrics as part of the Engineering Services team. He primarily develops automations and integrations for an array of clients using various tools and services. He studied in Maynooth University and in North Carolina, and enjoys playing music, hurling, and binge-watching movies.

You may also like...