- XM for Customer Frontlines
  - Digital
    Root out friction in every digital experience, super-charge conversion rates, and optimize digital self-service
  - Care
    Uncover insights from any interaction, deliver AI-powered agent coaching, and reduce cost to serve
  - Locations
    Increase revenue and loyalty with real-time insights and recommendations delivered straight to teams on the ground
- XM for People Teams
  - Engage
    Know exactly how your people feel and empower managers to improve employee engagement, productivity, and retention
  - Lifecycle
    Take action in the moments that matter most along the employee journey and drive bottom line growth
  - Analytics
    Whatever they’re are saying, wherever they’re saying it, know exactly what’s going on with your people
- XM for Strategy & Research
  - Research
    Get faster, richer insights with qual and quant tools that make powerful market research available to everyone
  - User Experience
    Run concept tests, pricing studies, prototyping + more with fast, powerful studies designed by UX research experts
  - Brand
    Track your brand performance 24/7 and act quickly to respond to opportunities and challenges in your market
- XM Platform
  Explore the platform powering Experience Management
  - Free Account
  - Watch Demo
- Teams
  - For Digital
  - For Customer Care
  - For Human Resources
  - For Researchers
  - All Teams
- Industries
  - Healthcare
  - Education
  - Financial Services
  - Government
  - All Industries
- Popular Use Cases
  - Customer Experience
  - Employee Experience
  - Employee Exit Interviews
  - Net Promoter Score
  - Voice of Customer
- - Free Account
  - Watch Demo
- Customer
  - Customer Success Hub
  - Product Documentation
  - Training & Certification
  - Community
  - XM Institute
- Learn
  - Popular Resources
  - Customer Stories
  - Blog
  - XM Knowledge Base
- Company
  - About Us
  - Careers
  - Partnerships
  - Marketplace
  - X4 Summit
    The annual gathering of the experience leaders at the world’s iconic brands building breakthrough business results, live in Salt Lake City.
- - Free Account
  - Watch Demo
CONTACT US
LOGIN
SUPPORT

Backend Development

Teaching Machines to Read Emails: Feature Selection

Zach McDonnell // January 20, 2017 // 9min read

In my previous post, I laid out the design for the ticket topic prediction model used at Qualtrics. This system sorts incoming support emails into different topics which are then used to route emails internally and suggest potential replies. Due to the amount of detail involved in developing this system, I want to elaborate on how feature selection in this system.

Why is Feature Selection Important?

As amazing as computers are, they are dumb when it comes to understanding text. It might as well be a different language—and if you think about it, it is! Humans read text, computers read binary. So for a computer to read text and extract meaning from it, we need to translate our text into something consumable by our model. That “something” is called a feature set: a list of inputs (usually represented as numbers) that a model can understand. Feature Selection is simply the process of extracting meaning and reducing noise and interference from your raw text documents. Just like a refinery process for gasoline, there are different grades of quality produced.

Text Processing

The first step in data mining is cleaning your data, also called text processing. Text processing isn’t glamorous but it is probably the most critical step in machine learning because bad data in will only give you bad data out. The work involved is tedious yet the concepts are simple. Here are the text processing steps that we followed to clean our data.

Filter Documents

Remove spammy or “bad” documents like out-of-office emails, marketing emails.
Remove short emails and other emails with “insufficient information”

Standardize Words

Remove formatting like html, css, and javascript
Lowercase all words
Remove numbers
Remove punctuation
Remove stopwords like “the”, “it”, and “that”
Stem words
Condense returns, newline characters, and other spaces

Filtering your documents is a great first step because it removes “weak learners” from your data set which allows you to create the model with less data for performance gains in speed. It can also increase accuracy in classification by reducing the noise from over represented documents (e.g. out-of-office emails). Forgetting to clean your data can lead to a lot of redundant and garbage features that will mask your desired features and ultimately reduce model performance.

Feature Generation Methods

A feature is a single input, and a group of features is a feature set. To generate features from text there are few ways to go, but the most common are n-grams. Features from n-grams are words or phrases that can be a binary option (i.e. does the word/phrase exist in this document?), term frequency (i.e. how often the n-gram appears?), or something more advanced like tf-idf.

Additionally, there are other methods such as word embeddings, part of speech taggers, or analyses like latent semantic analysis. However, these methods require more time to implement and may not provide significant gains over n-grams with some problems. Using n-grams is simple and, with some adjustments, can be just as effective.

Generating N-grams

N-grams is a great improvement over single word tokenization because it’s able to capture relationships between multiple words. For example, the monogram “car” and the bigram (two-word token) “car crash” carry different meanings. For our model, we decided to generate n-grams and add them to our feature set. After testing different sizes of n-grams, we found that generating bigrams and trigrams provided significant gains in our model performance, whereas N-grams above size three did not provide substantial additional benefit.

However, there’s a catch with this expanded feature set. Because n-grams naturally have a greater unique token count than single word tokens, adding n-grams to a feature set can inflate it significantly. For example, take the text from the Declaration of Independence. There are more unique bigrams and trigrams than there are unique words.

Furthermore, if you have a model generation that has a polynomial runtime, then the overall time to generate and process a single document can rise exponentially. This is another reason we limited our max n-gram to trigrams.

Optimizing the Feature Set

Even while limiting our n-gram selection to bigrams and trigrams, our model generation and performance time was longer than we liked. Because of this, we decided to optimize for information gain across each of the levels in our hierarchal ticket topics and filter out terms with low information gain. We chose to filter over a chi-squared distribution because information gain is often used for decision trees and our hierarchical model design is similar to a decision tree. Here’s an example of the labels and the ticket topics we are attempting to assign to each email.

In layman's terms, optimizing for information gain means selecting features that identify a particular node. For example, the phrase, “take a survey,” is probably more useful than, “could you help,” in identifying a particular node.

In the first level of the hierarchy above there are nine nodes. Information gain takes each node and determines which words minimize entropy, the measure of disorder in the corpus. If you’re interested in knowing how entropy is calculated check out this webpage.

IG = Information Gain, H(T) = entropy before division, H(T|a) = entropy after division

If the presence of a word decreases entropy significantly, then it’s useful and we should keep it in our feature set. Otherwise, we should get rid of it. In our model, we took each branch on each level and calculated the information gain for each word and selected the words that minimized entropy for that node. This helped tremendously by reducing our feature set from thousands to hundreds.

In addition, to mitigate the exclusion of important words not captured with information gain, we roll up the words from each child node into its parent node when selecting our feature set. So the feature set for the first level with GS|1 - GS|9 includes all the words in the levels below like GS|1|A, GS|3|C, and GS|8|B.

By rolling up the words selected from information gain, we’re able to use the entire feature set for the first level. However, the next level only includes the words specific to that branch on that level. This roll-up process minimizes the processing requirements and thus increases the speed of classification. It also trimmed the overall size of the model from gigabytes to megabytes.

Performance Gains

So what were our performance gains from this feature selection process? Our f1-score (see original post) was below .1 without any of the initial text processing or the addition of n-grams. After these changes, our f1-score increased to around .30. Then we pruned the hierarchy and removed bad nodes with little to no data which increased our score to .35. Furthermore, we now use an adjusted adaptive boosting each week to improve the model accuracy and our score has increased up to .58. In terms of latency, optimizing the feature set with information gain translated into a drop from seconds to milliseconds to sort a new email (sorting an email includes text processing and traversing the model hierarchy).

Final Thoughts on Feature Selection

For this project, feature selection was just as important to the performance of our model. Feature selection is simply the process of extracting meaning and reducing noise and interference from your raw text documents. Similar to gas for a car, feature selection provides your model with different grades of quality which can dramatically fuel the performance of any model. We initially spent a lot of time trying to tweak our model with minimal success but it was only after refining the feature selection process that we saw significant performance gains. In machine learning, the model gets the spotlight yet a quality feature selection process is just as important for performance.

Topics Automation Data Feature Selection Machine learning Preprocessing Support Text Processing

Zach McDonnell

September 30, 2019

Backend Development

Teaching Machines to Read Emails: Feature Selection

Why is Feature Selection Important?

Text Processing

Filter Documents

Standardize Words

Feature Generation Methods

Generating N-grams

Optimizing the Feature Set

Performance Gains

Final Thoughts on Feature Selection

Related Articles

Writing a Framework for Custom ETL Automations (pt. 2 of 2)

Writing a Framework for Custom ETL Automations (pt. 1 of 2)

Using Limits to Scale Efficiently

Lessons learned from a large-scale data migration

How to Think Full-Stack

Indexing Text for Both Effective Search and Accurate Analysis

Docker Exec and Maven AppAssembler

Intern Project: Creating a Global Search using Solr

Support

Company

Resources