Building a Customizable Platform for Sentiment Analysis
On May 12th we launched a brand new Sentiment Analysis tool to the world as part of the Qualtrics Experience Management Platform. We know Sentiment is important for understanding unstructured text, which is a rich repository of hidden insights. Sentiment can be a powerful tool for getting a quick pulse on how people perceive your brand or diving deep to find specific problems about your product.
The task seems straightforward enough: given a piece of text, determine whether it has positive or negative sentiment. For years this task has been a subject of interest to Natural Language Processing researchers, and businesses have been leveraging a myriad of open-source Sentiment tools generated from that research. So why is Sentiment still considered an unsolved problem? Why do modern sentiment algorithms still get it wrong sometimes?
First, the concept of “sentiment” is still somewhat ill-defined. It turns out that even humans can’t quite agree on what the true “sentiment” of text should be. We conducted a study with 1000 verbatims and found that 35% of the time, a group of five Qualtrics employees couldn’t come to a consensus on the “correct” sentiment.
Second, the actual sentiment of text often depends on the context in which it is given. As humans, this comes naturally to us through our wealth of experiences and common-sense. Algorithms face a much steeper learning curve without that innate knowledge. For example, we recognize instantly that “unpredictable” can be a positive aspect of a movie, but a negative aspect of an internet connection. The exact same text when given in different contexts has different sentiment, so we know that there’s a contextual component to getting it “right”.
But it’s not just domain-specific context that matters. It turns out that the exact same text could have different sentiment depending on what question was asked.
Third, there’s an unsolved problem in the field of Natural Language Processing called “coreference resolution”. Most languages have some method of specifying “coreferences” — sometimes called “anaphora” — where one word acts as a reference to a previous concept. In English, the words “it”, “that”, “this”, “she”, and “they” are typically good examples of anaphora because they don’t inherently indicate a specific thing or person, but are instead used to reference them. Coreference resolution is the act of disambiguating - determining which concept is being referred to, and humans are exceptionally good at it.
In this example, we know that “it” refers to the pie, because it would be absurd for movies to taste great. Unfortunately, computers don’t have that context out of the box.
Perhaps most difficult of all, automated approaches to sentiment tend to lack the ability to infer information that is not stated explicitly. Consider the case where the language used is all positive, but only because the speaker is referring to how things used to be:
Here the user implies that they are unhappy with the current state of affairs, that it used to be great but now is unsatisfactory, but they are not using inherently negative language to say so. How simple it is for us humans to understand, and yet how difficult for programs.
Make or Buy?
Our goal at Qualtrics is to build “Iron Man, not Ultron” and “comprehensible magic”. We make complex data analysis intuitive, but we believe that the users need to remain in control to do so. The same philosophy applies to our Sentiment Analysis - it needs to be fully explainable.
There’s also two main flavors of Sentiment: Lexicon (knowledge-based) methods and Statistical methods. Lexicon-based sentiment leverages a large lexicon of sentiment-laden words paired with a sentiment score for that word. For example “love” might have score of +2.8 while “hate” might have a score of -3.4. In order to assign a label, the algorithm would look for the presence of those keywords, and sum them up. Some Lexicon-based approaches take into account grammatical structures such as negation words (“not bad” should be positive, etc.). On the other hand, state-of-the-art statistical approaches to text classification such as Recurrent Neural Networks and LSTMs require a large set of labeled training data (movie reviews, etc.) but automatically learn which words or features are important in the text.
With our initial rollout of Sentiment Analysis, explainability was paramount. That’s why we decided (initially) to roll our own Sentiment Analysis solution instead of using an external sentiment API, and also why we decided to go with a Lexical approach. There’s certainly plenty of options to choose from, but by building our own lexical algorithm, we retained the ability to explain to customers why a particular label was positive or negative.
Obtaining a Lexicon
For starters, we began with an existing, tried-and-true sentiment lexicon for english called VADER (Valence Aware Dictionary and sEntiment Resonator), which you can try out yourself. VADER’s sentiment lexicon is open-source (MIT License) and each word was scored by aggregating the assignments made by 10 human evaluators. Being somewhat geared towards short-text social data, it also includes a wide range of emoticons.
Starting with an existing lexicon gave us a foundation on which to build, which was also highly explainable to our clients. It’s easy to demonstrate why a verbatim was assigned a certain sentiment. More than that, it opens up the possibility of clients easily customizing their own lexicon if they have special sentiment words unique to their business.
Improving the Lexicon
Of course, an off-the-shelf lexicon won’t match up perfectly with our client’s domains. In order to tune this lexicon to be more appropriate for survey-like text (question-answering, rather than reviews or twitter statements), we used a simple statistical technique to bootstrap and boost the basic lexicon.
The tricky part about modifying a sentiment lexicon is that with each subtle change (adding a word, removing a word, changing the score of a word), accuracy goes up on some types of text, but down on others. In order to evaluate whether or not our changes were helping across the board we gathered 5 diverse datasets. These included two academic sentiment datasets with hand-labels applied by researchers, as well as three of our own internal surveys in order to fine-tune the lexicon for more survey-like data.
We know that Sentiment is hard, and there’s no one metric for success which would capture all of the above complexities. So what kind of approximate metrics can we use? How do we benchmark the quality of an algorithm?
In academic literature, Sentiment is usually posed as a classification task. In order to evaluate the a new version of the lexicon, we compare the Precision and Recall of the lexicons against the “ground truth” of the 5 datasets. We don’t use Accuracy as a metric because it can be misleading, particularly on imbalanced datasets (where there are disproportionately more negative reviews than positive reviews, for example).
To put some intuition behind the usefulness of Precision and Recall, consider an engineering alerting system like VictorOps or PagerDuty. If the goal is to have 1 alert for every incident with your website, Recall could be thought of as “What percent of incidents triggered alerts?”. Precision could be thought of as “What percent of alerts were actually incidents?” Low-Precision, High-Recall would represent a very noisy system which alerts for all true incidents, but also generates a ton of noise, alerting for a whole bunch of false-positives. Conversely, a system with High-Precision, Low-Recall would never give any false alarms, but would miss a lot of real incidents.
The task was to figure out which words should be added to the lexicon and which word-scores needed adjustment in order to improve precision and recall globally across all of our test datasets. We can’t afford to greatly improve sentiment for some clients if it means degrading sentiment quality for others. To do this, we identified words which had a high “classification bias”, which means that they were consistently present in documents that were supposed to be more positive or more negative.
Each time a word occurred in a document, we contribute to its bias. In our preliminary tests with clients, we found that it’s much worse to get the polarity completely wrong (Positive, Negative), than it is to mistake a sentiment-laden response as Neutral. To capture this preference, we impose a large bias penalty whenever the polarity is mistaken and a small penalty whenever the algorithm misclassifies Neutral.
After performing this simple method on all the words across the whole corpora of datasets, we discovered some interesting insights:
Some words that weren’t currently in our lexicon should be added and should be negative. “Nor” is a great example. “Nor” is not inherently a negative word, it doesn’t carry any substantial meaning in isolation. But over all datasets, “nor” was consistently used instead of “or” in highly critical or negative text. So we added nor to our lexicon as a negative word. Other discovered words were more obvious (they were clearly sentiment-laden, but they just weren’t in the original VADER lexicon), for example: prompt, spotty, and intermittent.
Another interesting discovery was that some existing words needed significant adjustment. “Thanks” in particular was a strong example. In the context of surveys, many respondents tend to go on long rants or negative diatribes, yet end their comments with “thank you” or “thanks”. In isolation, “thanks” truly should be a positive word, but we found that it was being considered too positive. By softening the positive score for “thanks” and other similarly biased words, it actually drove up precision and recall for sentiment analysis quality. Likewise “treat” is not inherently negative, and in isolation it may seem entirely positive (a tasty “treat”). But in customer feedback corpora, it was more commonly used to talk about a bad experience (“is this how you treat your customers?”).
By modifying the original VADER lexicon with added words and adjusted word-scores, we were able significantly boost precision and recall over the whole set of corpora.
We measured overall improvement of the system via F1-score: a combination (harmonic mean) of precision and recall. In terms of F1-score, the bias-corrected lexicon actually did a bit worse on negative precision, but saw massive gains in negative recall. Precision and recall are intrinsically linked (it’s quite difficult to improve one without damaging the other). Because we had previously seen that original VADER lexicon had very poor negative recall on survey-style data, we were willing to accept this tradeoff for the benefit of catching more truly negative responses.
Sentiment is a tough problem to crack because it’s still an ill-defined problem and requires human-level context to truly get it right. All things considered, we’re quite happy with our first pass at Sentiment Analysis as a feature. This custom-tuned lexicon is simple to implement, the quality is comparable to many competitors, and the results are highly explainable to our clients. Of course, no model is perfect and there will be times where the algorithm gets it wrong. For those cases, we provided the ability to manually correct the sentiment whenever you disagree with the assigned label. Every correction made gives us critical information with which to improve the algorithm and make it even better for our next iteration.