Jiaqi Wu

Building an academic search engine

17 Jan 2020

In a text mining course, we were tasked with building an academic search engine for journals and competing with the class to obtain the most relevant search results given a specific query. Most of us had no idea where to start.

But, it turns out that we can decompose this problem into a simple supervised learning problem, and use a series of published “learning-to-rank” algorithms to help.

Feel free to follow along with our GitHub repository.


We need three things:

Both supervised and unsupervised algorithms exist for tackling this task. However, for supervised methods, we must have annotated query relevance scores. We need queries and documents (papers) for both methods.

Queries are straight forward–these are the search terms that a user would enter to find a paper. Here’s an example of what the queries for an academic search engine might look like:

{'0': ['information', 'retrieval'],
 '1': ['autoencoder'],
 '10': ['category', 'theory'],
 '11': ['smart', 'thermostat'],
 '12': ['spatial', 'multi', 'agent', 'systems'],
 '13': ['3', 'axis', 'gantry'],
 '14': ['cryptography'],
 '15': ['parametric', 'max', 'flow'],
 '16': ['graph', 'drawing'],
 '17': ['duality'],
 '18': ['fdir'],
 '19': ['personality', 'trait'],
 '2': ['sentiment', 'analysis'],
 '3': ['audio', 'visual', 'fusion'],
 '4': ['object', 'detection'],
 '5': ['gfdm'],
 '6': ['information', 'geometry'],
 '7': ['augmented', 'reality'],
 '8': ['prolog'],
 '9': ['wifi', 'malware']}

Example 1: queries for an academic search engine.

In terms of documents, we ideally want to extract the metadata that best defines a paper. Processing the entire manuscript for each paper in our database would be too computationally expensive. Instead, we concatenated the information where we would most expect user search terms to show up: * Title * Authors * Abstract * Keywords * Venue of publication

In this post, we will refer to “documents” as the concatenation of the metadata from a paper. There are lots of resources with free APIs for gathering papers and their metadata from scholarly journals.

Finally, we have the relevance score. Given a search query and a document, the relevance score measures the similarity between the query and document. If we want take a supervised learning approach, then these scores are required to train our classifier. In the example below, relevance scores are assigned a value from 0 to 4, with 4 being the highest degree of relevance.

query_id	doc_id	relevance
0       2277       2
0       373f       4
0       72d3       3
0       39f6       1
0       5ca4       0
0       013c       2
0       2c03       3
0       12d1       0

Example 2: query-document relevance scores.

Relevance scores are collected from users either actively or passively (manual ranking vs. retrieving number of clicks per document from an existing search engine). Either way, these “ground truth” relevance scores are typically limited or difficult to obtain, as they can be collected only through large search engines such as Google, or through labor-intensive annotations.

Our experiment contains a database of 8541 papers academic papers and 80 queries, as well as 3782 query-document relevance scores (the top ~30 query-document pairings for each query).

Unsupervised ranking algorithms

Okapi BM25 is a simple baseline retrieval function to rank a set of documents for a query. Given a query, BM25 uses a bag-of-words approach to rank documents based on query terms that appear in each document. Documents that frequently mention the query terms will be ranked higher than documents that do not mention the query terms at all. Other unsupervised approaches include TF-IDF and language models.

BM25 Ranking Function

More on the BM25 ranking function

Where _f(qi, D)_ is _qi_'s term frequency in the document _D_, _\|D\|_ is the length of document _D_ in words, and avgdl is the average document length in the databse. _k1_ and _b_ are free parameters, chosen in absence of advanced optimization. _IDF(qi)_ is the inverse document frequency weight of the query term _qi_.

In our project, we used metapy to index and rank our documents according to BM25. Metapy also supports language models, as well as other text mining functions that they cover in their documentation. We computed the BM25 score between every document in the database and the query “information retrieval”; the documents with the highest scores are returned below:

query_id	doc_id	bm25_score
0       391b        9.27404975891
0       ea9a        8.99035835266
0       5ca4        8.92638778687
0       72d3        8.78375816345
0       0a85        8.7498960495
0       31e1        8.73628807068
0       9228        8.73235225677
0       d618        8.68366146088

Example 3: BM25 scores for the query "information retrieval".

Compared to the absolute relevance scores from above, we see the matching doc_id 72d3. This document recieved a score of 8.78 for our query, and was given a relevance of 3 out of 4 by users. However, we’ll notice that a lot of the other “ground truth” relevant documents do not show up in the top results from BM25. Incoporating additional scoring methods could improve our results slightly; nonetheless, while unsupervised algorithms like BM25 do meet the baseline for a search engine, there are certainly limitations.

One major limitation of unsupervised ranking algorithms is that queries with a vocabulary gap are not likely to yield satisfactory results. For instance, if the query is “machine learning for images”, a paper about “computer vision” may never intersect with the query term. Similarly, the retail example is a query is “pregnancy clothes,” vs. a listing for “maternity dress”. This is where supervised ranking comes in handy–by interacting with real user data, we can help the classifier learn synonymous terms within our data, and bridge the vocabulary gaps.

Supervised ranking algorithms

You can follow along in our jupyter notebook.

The first step in supervised ranking is to train our classifier. We will be using the well-studied LambdaMART, which builds upon LambdaRANK by adding gradient boosted decision trees.

Our features will be a combination of the relevance scores collected from users (from example 2) as well additional unsupervised attributes calculated in the previous steps (BM25 score, absolute discounting, Jelinek-Mercer, Dirichlet prior, and additional metadata such as number of citations or references).

We used ~70% of the data for training, reserving 15% for testing, and 15% for validation. We tweaked various LambdaMART hyperparameters for the best results (i.e., learning rate, evaluation metric, number of estimators, max leaf nodes, etc). Specific workflow can be recovered from our jupyter notebook.

Evaluation metrics

We evaluate our results using Normalized Discounted Cumulative Gain (NDCG). This is a pretty simple metric, built on the basis that highly relevant documents appearing lower in a result list should be penalized with a grade logarithmically proportional to the position of the result. An NDCG@k score evaluates the top k search results. We can see a good example from a Stanford CS276 course below, given that 3 is the highest possible value:

>>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
>>> ndcg_at_k(r, 1)
>>> r = [2, 1, 2, 0]
>>> ndcg_at_k(r, 4)

Example 4: NDCG calculations.

We computed the NDCG scores for two version of our supervised LambdaMART classifier: one with two features (BM25 + absolute discounting), and one with four features (BM25, absolute discounting, number of key citations, and number of cited by metadata). We also list the NDCG scores from the unsupervised BM25-only ranking for comparison.

Okapi BM25
NDCG@5:	0.4460
NDCG@10:	0.4461
NDCG@20:	0.4615
NDCG@50:	0.5503
NDCG@100:	0.6610

LambdaMART with two features 
NDCG@5:	0.5964
NDCG@10:	0.5326
NDCG@20:	0.5326
NDCG@50:	0.5979
NDCG@100:	0.6730

Test NDCG@20: 0.2955

LambdaMART with four features 
NDCG@5:	0.6645
NDCG@10:	0.6635
NDCG@20:	0.6882
NDCG@50:	0.7458
NDCG@100:	0.7579

Test NDCG@20: 0.5569

Example 5: NDCG results of our experiments. This shows that our classifier returns a lot of the top results in the last 50 spots, because the NDCG jumps significantly from NDCG@50 to NDCG@100.

Overall, we see that using four features with LambdaMART obtains the highest NDCG@100 score, which is expected. We were surprised that the NDCG@100 from the BM25 unsupervised learning experiment actually barely trails the supervised score.

Building a search engine has many components, and this post attempts to summarize the main points. For details on replicating our results and workflow, please check out our jupyter notebook and GitHub repo!

Here’s a list of resources that helped me:

Tweet me @thejiaqiwu if you like this post.