Intern Project: Creating a Global Search using Solr

The Internal Systems team at Qualtrics wants to provide our clients, fellow employees, with an easy way to search through data. This data is in many different formats and has various use cases.

For example:

  • A sales representative wants to look up an employee to see if they are out-of-office.
  • A finance lead wants to search for a client’s current renewal status.
  • An engineering intern wants to find the summary of a bug closed over a year ago.

All of this data exists in our internal website, but it is spread out over different pages and modules. Indeed, we need a different search for every example above. We lacked a one-stop shop for finding internal information.

My Project

Given many categories of data, each one a separate table in our database, I needed to create a single search bar that would search through all of them and display relevant results. This “Global Search” had to be quick, accurate, and easy to use.

As an intern, it seemed like a daunting task, but my team had already developed many crucial tools that helped with the project. For example, we had a fast autocomplete search bar with the backend implemented through Apache Solr, an open-source search engine. However, users could only find results for a single category, and the most relevant results were not necessarily at the top. Nevertheless, this was a strong starting point for a robust Global Search.

Challenges

I needed to solve several problems to satisfy all the Global Search requirements.

In particular, I had to figure out how to:

  • Aggregate the data categories into a single data source.
  • Process human search terms.
  • Boost relevancy.

And of course the overarching challenge – how to do all of this with Apache Solr.

Solr

Apache Solr, aka Solr, is an extremely powerful open-source search platform built on Apache Lucene. On a high level, it takes data from the specified data source, maps it to premade field types, and indexes everything. Then, you can hit the API and efficiently query the data.

What I love about Solr is that it works great out-of-the-box – simply set a data source and use default fields – and also allows for incredible customizations. Indeed, when working with Solr I never encountered a problem that did not have some complex (but working) custom solution.

Since my team already implemented some basic search functionality with Solr, I decided to continue with the open-source technology for Global Search. That ended up being a decision I did not regret – Solr provided all the tools I needed to finish the project.

Aggregating Data

Within the Solr instance, each data category is indexed and stored in a different collection, and each collection gets data from a MySQL table. For example, employee data is in the “employeeCollection” in Solr, with the data import pulling from the “Employee” table.

Given this design, there were three options to aggregate categories such as Employee, Client, and Wiki.

  1. Combine them in the database in some aggregate view.
  2. Combine them in Solr in some aggregate collection.
  3. Aggregate them on the frontend when displaying results.

In order to keep the powerful functionality provided by Solr, it made sense to go with option two. Unfortunately, Solr does not natively support combining collections, so we had to find a way to do this ourselves.

Solution: I stumbled upon this stackoverflow post which outlined the basis of aggregating data within Solr. It is a scrappy solution that worked well for our purposes. In short, it is necessary to create an empty unification core, globalCollection, and specify the other collections as shards in the shards query parameter:

 

shards :    myurl/solr/employeeCollection

                     myurl/solr/clientCollection

                     myurl/solr/wikiCollection”

 

This distributes any query to globalCollection to all the shards, and aggregates results. So by creating a unification core, the user can get data from all the categories by searching only one source.

Processing Human Search Terms

Humans are pretty hard to understand, especially for a computer. Ideally, a user would type out the exact search term, with proper capitalization and special symbols, and only get results that match it perfectly. However, thanks to how good modern search engines are, people have high expectations for a computer to understand what they mean. So when using the search term “ivan’s inc”, the expectation is to be able to match the result with “IvanZait Inc”.

Solution: Solr has several built-in and configurable tools that allow for this kind of transformation. Indeed, most modern search engines will have some tools with the same functions.

 

White Space Tokenizer: This simply splits up the search terms and result data by white spaces. After applying it to the example above we would get:

Search term: “ivan’s”, “inc”

Result term: “IvanZait”, “Inc”

 

Lower Case Filter: This transforms everything to lowercase, an obvious but necessary tool.

Search term: “ivan’s”, “inc”

Result term: “ivanzait”, “inc”

 

English Possessive Filter: This removes any possessive endings from words.

Search term: “ivan”, “inc”

Result term: “ivanzait”, “inc”

 

Edge N-Gram Filter: This creates many different terms of increasing length from the result term.

Search term: “ivan”, “inc”

Result term: “i”, “iv”, “iva”, “ivan”, “ivanz”, “ivanza”, “ivanzai”, “ivanzait”, “i”, “in” ,”inc”

 

The Solr configuration for the above filters and tokenizers:

        <tokenizer class=”solr.WhitespaceTokenizerFactory”/>

        <filter class=”solr.LowerCaseFilterFactory”/>

        <filter class=”solr.EnglishPossessiveFilterFactory”/>

        <filter class=”solr.EdgeNGramFilterFactory” minGramSize=”1″ maxGramSize=”50″/>

 

After applying all these transformations, we can see that the search and result terms finally match, and thus would appear in the results for the given search.

Search term: “ivan”, “inc”

Result term: “i”, “iv”, “iva”, “ivan”, “ivanz”, “ivanza”, “ivanzai”, “ivanzait”, “i”, “in” ,”inc”

 

These are universal tools in dealing with human text input, and are especially important for creating a robust search.

Boosting Relevancy

Sadly, Solr is not initially tuned to understand what is relevant and what is not. For example, searching for “Ivan” might return the Wiki result where “Ivan” is mentioned in a footnote above the Employee result for “Ivan Zaitsev”. After I tested searches on the default Solr configuration, I realized that relevancy was completely broken.

Solution: First, create several new fields common among all the collections: priority_high, priority_med, and priority_low. Then, copy appropriate existing fields to each of the new fields. For example employee name should be copied to priority_high, while employee job title and office location can be copied to priority_med or even priority_low.

 

The full mapping for employeeCollection might look something like this:

priority_high : employee_name, id

priority_med : employee_email, department_name, type

priority_low : department_title, employee_phone, location

 

Second, add custom boosting to the priority fields. For example:

priority_high boost : 7

priority_med boost : 3

priority_low boost : 0.5

 

In Solr Terms:

defType : edismax

qf : priority_high^7 priority_med^3 priority_low^0.5  

 

By enabling the eDisMax Query Parser in the Solr schema, multiplicative boosts are applied to the “relevancy score” of each result. For instance, if a search matches a field in priority_high, then that score is multiplied by 7 and the result shows up at the top of all search results.

Even when we applied custom boosting, certain rarely-used categories were showing up above the rest. Since I had historical data on category search frequency, I also manually boosted each collection so the most popular categories show up at the top of results.

 

For example:

employeeCollection boost: 1.6

clientCollection boost: 1.1

wikiCollection boost: 0.7

 

In Solr Terms:

defType : edismax

boost: product(

if( termfreq( collection, ‘Employee’),1.6, 1),

if( termfreq( collection, ‘Client’), 1.1, 1),  

if( termfreq( collection, ‘Wiki’), 0.7, 1))

 

With these major changes (and several dozen minor tweaks), the top result was nearly always relevant.

Further Solr Exploration: Machine Learning

Recently, Solr has started to support some basic machine learning through the Learning To Rank contrib module. At its core, Learning To Rank allows the developer to define an array of features for each data entry, and then apply a trained model to get the most relevant results.

As of now, getting training data and actually training the model must be done offline (outside of the Solr instance). Some Engineers at Bloomberg LP describe how they used Learning To Rank to improve search relevancy in the Bloomberg Terminal.

For some detail on how to gather implicit training data, I would suggest the following research article on optimizing search engines.

Final Thoughts

In today’s online workflows, search is an absolutely crucial aspect for maneuvering around billions of bytes of data. The Google search bar has abstracted out many of the intricacies involved. For instance, spelling corrections, natural language interpretation, typing predictions, PageRank, context matching, and other processes happen during each search. Yet all the user experiences is a slight delay and then a page of tailored results.

Before I started on this project I had no idea how an effective search engine was created – I assumed it just magically worked. Now, after delving deep into Apache Solr, I am beginning to understand all of the tools and components that compose an effective search engine. I managed to harness this open source solution and create a search which combines many different data categories. Indeed, I believe this functionality will become more and more relevant as data becomes ubiquitous and distributed.

Overall, this summer internship at Qualtrics has been an amazing learning experience. I was surprised to be given such a high-value project and grateful for the trust my team had in me. And with this blog post, I’m excited to be able to say I have added to the literature of search.

Ivan Zaitsev
Ivan is a returning Software Engineering Intern at Qualtrics. For both years, he has been on the Internal Systems team building out products for fellow Qualtrics employees. Back at school he will finally finish his undergraduate studies at Cornell University and get his Bachelors in Computer Science. On the side, Ivan likes to dance hip hop and walk around cities aimlessly.

You may also like...