Backend Development

Writing Emails with Machine Learning

November 25, 2016 // 17min read


Although I am currently a software engineer, I started at Qualtrics helping clients as a support representative. After a year of answering hundreds of phone calls and writing thousands of emails, I noticed that I was writing the same emails over and over. I wished that we had some system that could write all of the repetitive emails for me. It would save me a lot of time and the company a lot of money.

After pivoting my career and learning a new skill set, I was tasked with solving this exact problem. Based on my background in support, we decided to create a recommender system using machine learning to suggest short snippets (we call “blurbs”) to insert into a support rep’s reply. A recommender system would help a new support rep quickly understand which blurbs are best suited for the reply they are writing and increase their productivity.

How do you start to build something like this? My team at Qualtrics had little background in machine learning, so we did a lot of research to build our knowledge in the field. One key insight arose from reviewing the data collected from our support tickets. We found that we had several years worth of data for the topics assigned to each support ticket. Based on this discovery, we decided it would be best to predict the ticket topic and then suggest blurbs related to that topic. This use of a labeled data set greatly simplified our approach. In particular, it meant we could approach this problem as a supervised learning problem instead of an unsupervised one.

Supervised Learning Machine Learning Template

Now it took a lot of research and a lot of iteration to build this system, but everything we did in our recommender system boiled down to one of three sections of model generation:


Please keep in mind that there is a lot that goes on in each step, but essentially these are the three steps in supervised machine learning:

  • Feature Selection/Generation is the process of taking your raw input, let’s say an email, and transforming it into a series of inputs (called features) that our machine learning model can understand and digest.
  • Model Selection/Generation is the process of selecting a machine learning model that can take your features and gives the correct classification output.
  • Model Performance Feedback is your key indicators that measure how well your features and your models are working. Ideally these indicators should help you in selecting new features and ultimately in the model generation.

Feature Selection

The feature selection/generation process is answering the question: How to represent your raw input in such a way that you can hand it to a model for prediction? Whenever I perform this step I feel like a miner extracting ore from a rock, wondering: “How do I extract as much ore as possible while maximizing purity and minimizing waste?” It’s a hard problem to solve and our current feature extraction techniques are still very primitive.

One of the most common feature extraction techniques for text is called Bag of Words. Essentially you take your email, blow it up into one-word chunks, and then dump all of the words into a bag. Your model will then look in the bag and ask “What words do I have?” to make a prediction. Nothing fancy, but it’s something a model can understand and it’s surprisingly accurate. With the addition of some enhancements like stop word removal, stemming, and n-grams, you can significantly improve the accuracy of your bag of words. One day when we have more time we may consider other methods like word embedding or part of speech tagging; however, when we designed our system we were unfamiliar with these advanced methods and decided to forgo these methods for now.

Model Selection

Once we determined how we were going to develop our feature selection we came to the decision of model selection. To give some context, most classification models are designed and tested on a small range of classes such as Positive-Negative, Number identification 0-9, or something similar. Our problem was a lot more complex because our possible ticket topics was a list of over a thousand different ticket topics arranged in a hierarchy:

Since we had so many ticket topics we took a step back and considered a few options.

Option 1: One Model to Rule them All

We discarded the idea of creating one overarching machine learning model because it was too unmanageable. Our model is held in memory and it quickly consumed a significant amount of RAM on the computer we were using. Because the model was so large we decided to explore alternatives.

Option 2: One Model per Ticket Topic

In this idea, we considered creating a series of classifiers for all 1,000 ticket topics and taking the best one. This idea wasn’t feasible because it was too slow. From our tests, some of our models would take around one second to complete. If all 1,000+ ticket topic models took 1 second each, simple math tells us it would take about 17 minutes to predict a ticket topic for a single ticket. So a series of classifiers wasn’t an option.

Option 3: One Model per Node in Ticket Topic Hierarchy

Our last idea involved utilizing the hierarchy of the ticket topics to generate our models. In this idea, each node of the hierarchy would have a model to predict which branch to follow. When estimating the run time for this scenario this was much easier to handle. If my ticket topic hierarchy of 1000 ticket topics is 7 levels deep and each model takes 1 second to complete than my average runtime for one prediction would be around 7 seconds. Compared to the 17-minute runtime of Option 2 and the advantage to regenerate individual nodes instead of the mega-model of Option 1, we moved forward with Option 3.

Measuring Model Performance

There are several different ways to measure model performance. The most basic is accuracy (i.e. how many predictions are correct?). This measure sounds simple enough, but after some close examination, it can lead to some fatal errors. For example, if you are trying to predict a cancer that occurs in 10 in 1000 people and your model guesses negative for cancer in all 1000 you have an accuracy of 98% despite the fact it has failed to detect any cancer.

In other words, the cost of a false negative is high so accuracy is not a good measure here. In this case, you probably want to use recall. Basically, recall is how often can your model recognize the 10 people that have cancer. If your model can recognize 7 out of 10 then you have a recall of 0.7.

Now let’s imagine that our model predicts that all 1000 people have cancer. Our recall is great but now everyone thinks they have cancer. To address this problem, we decided to use something called an F1-score which balances the recall rate with precision like so:


Putting It All Together

So what does our system look like? Each time a new email arrives we send the email to our model. The model navigates down the hierarchy until it has insufficient information to continue, or reaches a leaf. At that point, we tag the email and move on. This entire process takes just under a second per email. When the support rep opens the email they see the suggested ticket topic and the highest ranked blurbs tagged with that ticket topic. If the ticket topic is incorrect they can correct the ticket topic and see a new list of suggested blurbs for the corrected ticket topic.

In addition, each week we refresh the ticket topic hierarchy and take the last ~10,000 emails with a ticket topic and train our model hierarchy. That way our model learns new topics and incorporates feedback from mistagged tickets. Model regeneration only takes about 30 minutes for the entire hierarchy.

We focus mainly on the F1-Score of the first level to measure our performance.  When we first released this system, the F1-Score started around 0.4 but has steadily increased. Currently, at the time of the blog post, the F1-Score hit a high of 0.59. Here’s a graph to show how much this model has progressed:


In the future we hope to improve on the model we have created by incorporating some of the advanced techniques mentioned above (like word embedding) and adjust the design of our model. Currently, we only explore the best branch at each node which caps the accuracy of our model to the accuracy of the first level in the hierarchy. By exploring the Top N branches at each node we can mitigate the false positives. We hope to write future posts on this recommender system and will explore certain details of the model in depth.

Related Articles