Semantic Search for Similar Articles

Problem:

Finding a semantic relationship between text has always been a challenging problem.

General Solution:

We have different solutions such as BERT Sentence Transformers and Google’s Universal Sentence Encoder that has found relationships between text a lot better than just using tf–idf.

With new developments in semantic relations between text, we have a better way of finding relationships for semantic search. To demonstrate this, we are going to use a Trump news dataset from Kaggle.

You can view the full code here in the jupyter notebook.

From the dataset, we are trying to find the most related news articles to the title:
‘U.S. states, hospitals plead for help as Trump approves coronavirus aid bill’

Here is a sample of the text:
“’U.S. doctors and nurses on the front lines of the coronavirus outbreak pleaded for more equipment to treat a wave of new patients expected to swamp capacity on Friday, shortly before President Donald Trump invoked emergency powers compelling the production of medical equipment. In a later address, Trump further pledged the United States would produce 100,000 ventilators, sending any unneeded units to Italy and other countries with high caseloads. The United States coronavirus-related death toll now ranks sixth worldwide, with over 1,500 fatalities recorded, according to a tally from Johns Hopkins University. Worldwide, confirmed cases of the virus have risen above 590,000, with more than 26,800 deaths.”

The steps to do find relationships between texts are:

  1. Computers love numbers, so we transform the text into a matrix representation of the corpus. How we transform the text is key because some ways transform text better than others.
  2. Find relation using a similarity metric such as Cosine Similarity. But we found that using a matrix transpose was much more scalable with very similar results.

TFIDF

The typical way to turn text into matrices is tf–idf. The solution is fast, simple, and scalable. But since the relationship is based on term frequencies, the relationship between news article texts is not as strong.

After transforming the text into vectors and using a matrix transpose to find relations, here are the results:

TF IDF

Given the titles, there is a relation from the original news article, but the results are not always the most meaningful. There is a relationship between coronavirus, but the relationship shows it’s closest to Canada and around the world, which is not as useful.

BERT

With the emergence of Deep Learning techniques, we have been able to find stronger relationships between text, better than ever before. Let’s see how the relations work with BERT, a relatively new way to find relations between text. Unlike other language representation models, BERT is a pre-trained deep learning model for bidirectional representations. Since the model is pre-trained we can use it out of the box.

Here are the results:

BERT

From the results, we are getting better relations and a much bigger variety of news article titles that are related to coronavirus in the US. The model is able to find better relationships showing more related articles.

Google USE

Looks like Google USE has the best semantic relations since it is trained on a similarity dataset.

Semantic Similarity

Google USE

Picture from: Universal Sentence Encoder

There is not much information on how the Deep Learning algorithm works, but it was trained specifically on finding semantic similarity between text.

Given the different ways to embed text, we believe that Google USE would be the best choice followed by BERT because they both take advantage of current Deep Learning frameworks to find relationships between text like never before. The only downside is that these models are quite a memory intensive, but that is the trade-off you may decide to make for accuracy vs speed.

If you are interested in learning how we can implement these techniques into your next business solution, feel free to reach out to us at [email protected]