- XM for Customer Frontlines
  - Digital
    Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service
  - Care
    Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
  - Locations
    Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground
- XM for People Teams
  - Engage
    Know exactly how your people feel and empower managers to improve employee engagement, productivity, and retention
  - Lifecycle
    Take action in the moments that matter most along the employee journey and drive bottom line growth
  - Analytics
    Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
- XM for Strategy & Research
  - Research
    Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
  - User Experience
    Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
  - Brand
    Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
- XM Platform
  Explore the platform powering Experience Management
  - Free Account
  - Watch Demo
- Teams
  - For Digital
  - For Customer Care
  - For Human Resources
  - For Researchers
  - All Teams
- Industries
  - Healthcare
  - Education
  - Financial Services
  - Government
  - All Industries
- Popular Use Cases
  - Customer Experience
  - Employee Experience
  - Employee Exit Interviews
  - Net Promoter Score
  - Voice of Customer
- - Free Account
  - Watch Demo
- Customer
  - Customer Success Hub
  - Product Documentation
  - Training & Certification
  - Community
  - XM Institute
- Learn
  - Popular Resources
  - Customer Stories
  - Blog
  - XM Knowledge Base
- Company
  - About Us
  - Careers
  - Partnerships
  - Marketplace
  - X4 Summit
    The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.
- - Free Account
  - Watch Demo
CONTACT US
LOGIN
SUPPORT

Operations

Maintaining Availability in a $2.5B Company

David Gonzalez // April 14, 2017 // 6min read

How We Manage Availability

Qualtrics operates and develops products in a service oriented architecture. We went through the same evolution that many large companies go through - build a monolith application then break it apart into manageable chunks. But how do you maintain availability and performance with such a large scope of complexity? Short answer - visibility! You need to know what is going on in your product from the physical layer through the customer experience.

We build transparency into our product so that our engineers can monitor, maintain, and ensure the best possible customer experience for our market researchers. Our engineers work in a development operations (or DevOps) model where developers are responsible for the quality, uptime and performance of the product. Our DevOps team builds frameworks and tools that help our engineers quickly configure key monitors that not only watch if a system or service is healthy, but also directly monitor the customer experience.

So, we all know that we need monitoring and alerting. But what exactly does that mean? Can’t you just throw up a quick ping test in Nagios to make sure your server is alive? That’s a start, but we take it a step further. The key to ensuring the best customer experience is to understand the customer’s journey from beginning to end. It means not just knowing when a problem happens, but warning signs that there will be a problem. It means understanding the extent of the impact on our customer experience.

Defining the Stack

Let’s break this down. Any typical server/client architecture will have 3 major components: The edge devices, the application, and the platform infrastructure.

Tools to manage availability — Qualtrics Software Stack

We call it ‘the edge’ because these are devices at the very edge of our network. This means your laptop or mobile device at home and the cable modem that your device is connected to. As an engineer, the edge is outside of our scope of control. It’s the wild. Once it’s out there, all we can do is observe. To monitor the edge, we rely on a combination of user synthetic (or simulated) monitoring tools such as Catchpoint, and application performance monitoring tools such as NewRelic. Catchpoint gives us the ability to simulate the user experience and alert if that experience is degraded in any way. NewRelic gives us visibility into to the application as it runs on a user’s browser. This is called Real User Monitoring (or RUM).

The application is what we manage. A respondent will access a survey on our system and in doing so, will access various code basis, application servers, databases, and content delivery programs. This is the core of what Qualtrics does. As such, it’s important to have a significant amount of data and visibility into what our application is doing on our hardware. The chances of us catching something are likely to happen within the boundaries of our product first. Some examples of tools we use here are MonYog and Sensu (an alternative to Nagios). Monyog gives us visibility into database health, which is a core piece of our product. Sensu is a host based monitoring application that can be configured to monitor everything from system health metrics such as load and memory utilization to application monitoring such as looking for an active Apache web server process.

Finally, The platform is what our product lives on. This is hardware, network, server racks, cooling, routers and switches, and the data lines by which the content is delivered. Internally, we call it oxygen. You can’t see it but you need it to live. If you don’t have it, you are in a critical condition. We use a combination of LibreNMS and Nagios to look for network errors, connectivity issues, traffic trends, and so on.

Tying It All Together

In a mature organization, you don’t just need monitoring and alerting tools such as those listed above but also solid processes in place in order to quickly react to problems. This is typically called an Incident Management process.

Maintaining availability through incident management — *Incident Management Process*

The process always starts with an event. Something breaks. Someone trips on a cable. A tornado wipes out a data-center. Barnaby chews on the network cable for the office (...don’t ask). This is where your monitoring tools, like the ones listed above, kick into play. They see something go wrong and they start alerting. Alerts go to engineering teams and on-call engineers in the form of automated emails and phone calls. The on-call engineer can then begin troubleshooting the problem. If the problem is big enough, we trigger an incident. At this point, the owner of the problem corrals any and all resources onto a call to work the problem until it’s fixed. Think red alert when the USS enterprise is getting attacked. Once the problem has been fixed, the team goes into a root cause analysis with a focus on preventing future failures and understanding the impact to our customers. We never want to see the same failure happen twice, so we use the opportunity to harden our product..

Coming Soon!

In coming posts we’ll dive into the nitty-gritty, like how to setup meaningful monitors, how to create alerts that are actionable, how to manage an incident, and how to conduct a productive root cause analysis. If you are interested in learning about how our Qualtrics engineers developed these tools and best practices, stay tuned for more!

Topics Availability Devops Engineering Operations

David Gonzalez

October 16, 2019

Agile

Maintaining Availability in a $2.5B Company

How We Manage Availability

Defining the Stack

Tying It All Together

Coming Soon!

Related Articles

Running a Scalability & Resiliency Program

Writing a Framework for Custom ETL Automations (pt. 2 of 2)

Writing a Framework for Custom ETL Automations (pt. 1 of 2)

Streamlining Chaos to Lead with Intent

Minimizing Operational Cost of Inherited Services

Build Isolation at Qualtrics

Docker Exec and Maven AppAssembler

Integrating into Qualtrics: Docker Deployment

Support

Company

Resources