Teaching Machines to Read Emails: Feature Selection
In my previous post, I laid out the design for the ticket topic prediction model used at Qualtrics. This system sorts incoming support emails into different topics which are then used to route emails internally and suggest potential replies. Due to the amount of detail involved in developing this system, I want to elaborate on how feature selection in this system.
Why is Feature Selection Important?
As amazing as computers are, they are dumb when it comes to understanding text. It might as well be a different language—and if you think about it, it is! Humans read text, computers read binary. So for a computer to read text and extract meaning from it, we need to translate our text into something consumable by our model. That “something” is called a feature set: a list of inputs (usually represented as numbers) that a model can understand. Feature Selection is simply the process of extracting meaning and reducing noise and interference from your raw text documents. Just like a refinery process for gasoline, there are different grades of quality produced.
The first step in data mining is cleaning your data, also called text processing. Text processing isn’t glamorous but it is probably the most critical step in machine learning because bad data in will only give you bad data out. The work involved is tedious yet the concepts are simple. Here are the text processing steps that we followed to clean our data.
- Remove spammy or “bad” documents like out-of-office emails, marketing emails.
- Remove short emails and other emails with “insufficient information”
- Lowercase all words
- Remove numbers
- Remove punctuation
- Remove stopwords like “the”, “it”, and “that”
- Stem words
- Condense returns, newline characters, and other spaces
Filtering your documents is a great first step because it removes “weak learners” from your data set which allows you to create the model with less data for performance gains in speed. It can also increase accuracy in classification by reducing the noise from over represented documents (e.g. out-of-office emails). Forgetting to clean your data can lead to a lot of redundant and garbage features that will mask your desired features and ultimately reduce model performance.
Feature Generation Methods
A feature is a single input, and a group of features is a feature set. To generate features from text there are few ways to go, but the most common are n-grams. Features from n-grams are words or phrases that can be a binary option (i.e. does the word/phrase exist in this document?), term frequency (i.e. how often the n-gram appears?), or something more advanced like tf-idf.
Additionally, there are other methods such as word embeddings, part of speech taggers, or analyses like latent semantic analysis. However, these methods require more time to implement and may not provide significant gains over n-grams with some problems. Using n-grams is simple and, with some adjustments, can be just as effective.
N-grams is a great improvement over single word tokenization because it’s able to capture relationships between multiple words. For example, the monogram “car” and the bigram (two-word token) “car crash” carry different meanings. For our model, we decided to generate n-grams and add them to our feature set. After testing different sizes of n-grams, we found that generating bigrams and trigrams provided significant gains in our model performance, whereas N-grams above size three did not provide substantial additional benefit.
However, there’s a catch with this expanded feature set. Because n-grams naturally have a greater unique token count than single word tokens, adding n-grams to a feature set can inflate it significantly. For example, take the text from the Declaration of Independence. There are more unique bigrams and trigrams than there are unique words.
Furthermore, if you have a model generation that has a polynomial runtime, then the overall time to generate and process a single document can rise exponentially. This is another reason we limited our max n-gram to trigrams.
Optimizing the Feature Set
Even while limiting our n-gram selection to bigrams and trigrams, our model generation and performance time was longer than we liked. Because of this, we decided to optimize for information gain across each of the levels in our hierarchal ticket topics and filter out terms with low information gain. We chose to filter over a chi-squared distribution because information gain is often used for decision trees and our hierarchical model design is similar to a decision tree. Here’s an example of the labels and the ticket topics we are attempting to assign to each email.
In layman's terms, optimizing for information gain means selecting features that identify a particular node. For example, the phrase, “take a survey,” is probably more useful than, “could you help,” in identifying a particular node.
In the first level of the hierarchy above there are nine nodes. Information gain takes each node and determines which words minimize entropy, the measure of disorder in the corpus. If you’re interested in knowing how entropy is calculated check out this webpage.
IG = Information Gain, H(T) = entropy before division, H(T|a) = entropy after division
If the presence of a word decreases entropy significantly, then it’s useful and we should keep it in our feature set. Otherwise, we should get rid of it. In our model, we took each branch on each level and calculated the information gain for each word and selected the words that minimized entropy for that node. This helped tremendously by reducing our feature set from thousands to hundreds.
In addition, to mitigate the exclusion of important words not captured with information gain, we roll up the words from each child node into its parent node when selecting our feature set. So the feature set for the first level with GS|1 - GS|9 includes all the words in the levels below like GS|1|A, GS|3|C, and GS|8|B.
By rolling up the words selected from information gain, we’re able to use the entire feature set for the first level. However, the next level only includes the words specific to that branch on that level. This roll-up process minimizes the processing requirements and thus increases the speed of classification. It also trimmed the overall size of the model from gigabytes to megabytes.
So what were our performance gains from this feature selection process? Our f1-score (see original post) was below .1 without any of the initial text processing or the addition of n-grams. After these changes, our f1-score increased to around .30. Then we pruned the hierarchy and removed bad nodes with little to no data which increased our score to .35. Furthermore, we now use an adjusted adaptive boosting each week to improve the model accuracy and our score has increased up to .58. In terms of latency, optimizing the feature set with information gain translated into a drop from seconds to milliseconds to sort a new email (sorting an email includes text processing and traversing the model hierarchy).
Final Thoughts on Feature Selection
For this project, feature selection was just as important to the performance of our model. Feature selection is simply the process of extracting meaning and reducing noise and interference from your raw text documents. Similar to gas for a car, feature selection provides your model with different grades of quality which can dramatically fuel the performance of any model. We initially spent a lot of time trying to tweak our model with minimal success but it was only after refining the feature selection process that we saw significant performance gains. In machine learning, the model gets the spotlight yet a quality feature selection process is just as important for performance.