Movie Overviews in ElasticSearch

Problem:

Searching through needed documents quickly and finding relevant information from it can be challenging for today’s organizations. There is also a challenge of storing the documents and retrieving the relevant documents quickly.

General Solution:

To solve this, we built a search engine for our client to quickly store and find useful political documents for policy research using ElasticSearch, a powerful tool for full-text search. Since it uses an inverted index, it can search through lots of documents quickly and return results using ElasticSearch queries.

For this example, we are going to use the movies dataset from Kaggle to demonstrate the capabilities. That way, you can apply it on a project of your own.

You can view the full code in the Jupyter Notebook here:

First, let’s import the Python packages needed to run the code in jupyter notebook:

Elastic Search

! netstat -a -n | grep tcp | grep 9200 in the jupyter notebook cell helps connect to the localhost of 9200 for you.

Loading Data

There are many columns in this dataset, but for the sake of the demo, we are only focusing on the movie’s title, overview, release date, and vote average.

Loading Data

Mapping

To start using ElasticSearch, we will need to create a map to store the information we need. We then name our variables and data types for ElasticSearch. ElasticSearch can detect the variable data types for you without the mapping, but there are some very specific things that we would like to do with the date that we have to establish manually. We want to treat the “release date”.

Mapping

Now, let’s create an index for storing the information to lookup later.

index name

Using a for loop in Python, we load each row of the Pandas DataFrame into ElasticSearch. We are only using a selective amount of columns for this demonstration.

Note: It is important to cast the Dataframe columns as lists before running the loop.

Dataframe columns

Date Range

With the information loaded in ElasticSearch, we can use that to find a date range shown below. For this example, we would like to find movies that were released in 2016:

Movie Overviews in ElasticSearch

Here are the top 5 results:

Deadpool

Adding a Decay Function:

We can also add a decay function, which means older movie titles will be weighted less.
Depending on the scale, offset, and decay, you will receive slightly different results as shown:

Adding a decay function image 1 Adding a decay function image 2

Adding a decay function image 3Adding a decay function image 4

Finding Similar Movies

We can find similar movies with the ‘more like this’ function in ElasticSearch. But first, let’s pull up some information about a movie using the index number. Using index 10, we pulled up the movie “Blade Runner”.

finding similar movies image 1

Using the same index number, we use the “more like this” function to find other similar movies based on their movie overview:

Movie Overviews in ElasticSearch image 2

Here are the top 5 movies that are:

Movie Overviews in ElasticSearch image 3

You can also use the same function to look up a particular word in the overview:

field overview

Results:
field overview image 2

Additional Things:

We can add a fuzziness that will help by giving some leeway for a misspelled word.  Also, we can add a highlight where we can italicize the word we are searching for in the results. As you can see ‘warz’ is spelled wrong, but its fuzziness corrects the spelling in one letter.

additional things

This is just a small sample of what you can do with ElasticSearch using movie overviews.

A more typical use case would be:

  • Legal Documents
  • Immigration Paperwork
  • Government PDFs
  • Scientific Research Papers
  • Corporate Information

If you would like to learn more about how we can use ElasticSearch for your business and even run this on GCP, please contact us for more information.

If you are interested in learning how we can implement these techniques into your next business solution, feel free to reach out to us at [email protected]