Searching through needed documents quickly and finding relevant information from it can be challenging for today’s organizations. There is also a challenge of storing the documents and retrieving the relevant documents quickly.
To solve this, we built a search engine for our client to quickly store and find useful political documents for policy research using ElasticSearch, a powerful tool for full-text search. Since it uses an inverted index, it can search through lots of documents quickly and return results using ElasticSearch queries.
For this example, we are going to use the movies dataset from Kaggle to demonstrate the capabilities. That way, you can apply it on a project of your own.
You can view the full code in the Jupyter Notebook here:
First, let’s import the Python packages needed to run the code in jupyter notebook:
! netstat -a -n | grep tcp | grep 9200 in the jupyter notebook cell helps connect to the localhost of 9200 for you.
There are many columns in this dataset, but for the sake of the demo, we are only focusing on the movie’s title, overview, release date, and vote average.
To start using ElasticSearch, we will need to create a map to store the information we need. We then name our variables and data types for ElasticSearch. ElasticSearch can detect the variable data types for you without the mapping, but there are some very specific things that we would like to do with the date that we have to establish manually. We want to treat the “release date”.
Now, let’s create an index for storing the information to lookup later.
Using a for loop in Python, we load each row of the Pandas DataFrame into ElasticSearch. We are only using a selective amount of columns for this demonstration.
Note: It is important to cast the Dataframe columns as lists before running the loop.
With the information loaded in ElasticSearch, we can use that to find a date range shown below. For this example, we would like to find movies that were released in 2016:
Here are the top 5 results:
Adding a Decay Function:
We can also add a decay function, which means older movie titles will be weighted less.
Depending on the scale, offset, and decay, you will receive slightly different results as shown:
Finding Similar Movies
We can find similar movies with the ‘more like this’ function in ElasticSearch. But first, let’s pull up some information about a movie using the index number. Using index 10, we pulled up the movie “Blade Runner”.
Using the same index number, we use the “more like this” function to find other similar movies based on their movie overview:
Here are the top 5 movies that are:
You can also use the same function to look up a particular word in the overview:
We can add a fuzziness that will help by giving some leeway for a misspelled word. Also, we can add a highlight where we can italicize the word we are searching for in the results. As you can see ‘warz’ is spelled wrong, but its fuzziness corrects the spelling in one letter.
This is just a small sample of what you can do with ElasticSearch using movie overviews.
A more typical use case would be:
- Legal Documents
- Immigration Paperwork
- Government PDFs
- Scientific Research Papers
- Corporate Information
If you would like to learn more about how we can use ElasticSearch for your business and even run this on GCP, please contact us for more information.