NLP meets the NBA Finals Game 3: Topic Clustering

The Raptors are back on top after the NBA Finals Game 3, and just as we’ve done for the last couple of games, we want to know what people are saying about it. We’ve looked at Twitter’s reactions to Game 1 and Game 2 through the lens of sentiment analysis, and today we’re adding a new tool from the Natural Language Processing (NLP) toolkit: topic clustering.

Topic clustering seeks to group documents based on meaningful elements of content. Variants of the technique can be used on everything from email spam classifiers to separating legal documents into relevant subcategories. In this case, we’re going to see if we can identify some patterns in tweets about the teams, players and coaches involved in the NBA Finals.

A Digression to Discuss Word Embeddings

But first, let’s talk a little about Word Embeddings, a cornerstone of many modern NLP techniques.

In order to find mathematical patterns in text, we first need to convert that text into numbers. The modern way of doing this is to find “word embeddings” for each word in a large vocabulary. Word embeddings are lists of numbers that correspond to each word, such that words that are used similarly have similar numbers, and words that are not used similarly have very different numbers.

If we were inventing our own embeddings for a very specific domain like balls used in sports, we could make up some meaningful attributes like tendency to be thrown, tendency to be kicked, and foulness. A baseball might have an embedding of [0.9, 0.2, 0.8] because we often encounter text about baseballs being thrown, rarely hear about them being kicked, and pretty frequently hear about them being foul (although really that usage of “foul ball” would be associated with the word “ball” since no one says “That’s a foul baseball”, but let’s assume we were always referring to each ball by its full name). A soccer ball might have an embedding of [0.4, 0.9, 0.5] because it’s not thrown as often, is kicked very often, and it’s still associated with fouls. But wait. We say a soccer ball goes “out of bounds,” not that it’s “foul.” Nevertheless, because “fouls” happen a lot in soccer, we have to include some association with the word “foul” in our embedding of  soccer ball. And of course while “baseball” is one word, “soccer ball” is two, so if we are attaching meaning to individual words, how do we handle that?

We can already see how the messiness of language makes coming up with good word embeddings a daunting task, even when we’re confined to the most specific domain I could think of for this example.

Fortunately, a lot of very smart people have been working on this problem since the 1960s, and in addition to helping us group words that are used similarly, they’ve demonstrated some interesting properties, like the relationships of capitals (“Paris” or “London”) having consistent relationships with their countries (“France” and “England”). There’s also some arithmetic that has been demonstrated, like “King” – “Man” + “Woman” = “Queen”.  There have been some very interesting work on attempting to understand and separate gender bias from word embeddings, which is critical if we want to be able to build systems that don’t carry past human bias into the future.

We’ve seen some amazing breakthroughs in the last few years. In 2013, a team from Google led by Tomas Mikolov released embeddings created by a system they called Word2Vec. Since then, we’ve seen GloVe and ELMo embeddings each ascend to the NLP throne, only to be replaced. The current leading embeddings come from a system called BERT, also released by Google. Whereas our toy example above reduced words into a list of three numbers each, BERT embeddings consist of 768 dimensions.

We’ve already gotten far enough away from our topic of the NBA Finals, but if you’re interested in learning more about the inner workings of BERT, I highly recommend heading over to Jay Alammar’s blog to learn more.

Back to the NBA!

Now, how are we using that to learn more about what people are saying about the game?

We have two challenges left:

  1. We need to combine the lists of numbers BERT gives us for each word (or real pieces of words, as it is advanced enough to be able to draw meaning from components of words like ‘ball’ in ‘basketball’) into a list of numbers representing the whole tweet, since we are interested in comparing tweets in this example, not individual words.
  2. We need to find a way to plot that 768-dimensional representation in 2 dimensions for display on our flat screens.

For the first point, we’re first going to remove emojis and ‘stopwords’ (commonly used words that don’t provide a lot of meaning), and find the average at each position in the remaining words’ 768-dimensional vector (or list of numbers). That leaves us with a 768-dimensional vector that sits in 768-dimensional space at the midpoint of the points represented by the words we considered meaningful. It’s not the only way to combine these vectors, and more meaningful ways to embed longer passages of text are emerging, but it serves us for this example.

For challenge #2, we turn to another technique called t-Distributed Stochastic Neighbor Embedding or TSNE for short. TSNE is designed to reduce dimensionality for exactly this purpose: plotting high-dimensional representations in a smaller number of dimensions. Instead of simply cutting off the last 766 dimensions and looking at essentially a two-dimensional shadow, TSNE targets a smaller number of dimensions (typically 2) and seeks to preserve the distance relationships between data points as it projects them onto this lower-dimensional canvas. More simply, if points were far away in 768-dimensional space, TSNE wants them to be far away in 2-dimensional space, and if they were close to one another in 768-dimensional space, TSNE tries to keep that proximity.

Plot 1: 2D Topic Clustering

I’ve clustered these tweets into 30 clusters (using an algorithm called K-Means). It did the groupings in 768-dimensional space, and we can see that as they’ve been projected down to 2 dimensions, TSNE has done a pretty good job of keeping like (similarly colored) data points together.

You should be able to mouse over these plots to see the tweet that each data point represents. See if you can find similarities between the tweets of similar colors that might have caused those tweets to be grouped together. Also notice the limitations of what are meaningful groupings and what aren’t. Note that in this case we didn’t do additional training to fine-tune the BERT embeddings, but in a real client setting, we would want most likely want to do that.

Plot 2: 1D Topic Clustering (X-Axis) by 1D Sentiment Analysis (Y-Axis)

With this plot, I wanted to bring sentiment analysis back into the mix. This plot asks even more of TSNE by instructing it to reduce the 768-dimensional representations down to just ONE dimension to represent topic, subject matter, or word use (x-axis), while on the y-axis we’re plotting by sentiment, with more positive tweets at the top, and more negative tweets at the bottom. Enjoy, and see what you can find!