With the 2019 hurricane season underway, we were curious if we could build an accurate model to quickly classify relevant and non-relevant tweets to understand disasters in real-time.
SpringML’s Model to Predict Disasters Using Social Media
Previous reports have shown that Twitter can predict disasters as well as government agencies including FEMA and the CDC. For this usecase, we used the original dataset featured on Kaggle and created two models:
- A Traditional Machine Learning Model
- A Recurrent Neural Network Model in Tensorflow 2.0
Traditional Machine Learning Model Process:
Clean the dataset
We used Regex to clean the dataset. We also removed punctuation, stopwords, and applied stemming before we vectorized the tweets
Vectorize and split dataset
Computers love numbers, but not text, so the next step was to transform the tweet into a matrix representation. Then the dataset was split as 80 percent train and 20 percent test.
The Support Vector Classifier(SVC) with a linear kernel did the best because there was a clear linear separation between the opposing classes in high dimensional space.
Evaluating Model Results
Using the confidence score from the dataset, we scrubbed out tweets lower than .7 of confidence. The dataset went from 10,876 to 8193 tweets, but the accuracy improved greatly.
Metrics for the final model:
Classification Report for further breakdown per class:
Here were some challenges:
- There are different spellings for the same thing. For example, wildfires are spelled as ‘#FingerRockFire’, ‘#CalWildfires’, ‘wildfire’, and ‘wild fires’.
- Word ambiguity was a problem because the word ‘Armageddon’ is a movie title, but it also refers to the end of times. This can cause confusion for the model.
Classifying Tweets with Tensorflow 2.0
We also experimented with Text Classification in Tensorflow 2.0 to compare against traditional machine learning techniques. This model uses a recurrent neural network (RNN) to learn how the sequence of words are used to make a prediction.
Encoding the data was a challenge, but we processed the text by adapting the code from here.
Here is an example:
Here are the results after training for 10 epochs:
We were never able to get accuracy any higher than 75% no matter what we tried.
Here are some visualizations:
Here are some predictions from the Tensorflow model:
This tweet is under .5 in confidence, which is ‘Not Relevant’.
This is a ‘relevant’ tweet since it is above the .5 confidence threshold.
Fast, highly accurate, and predicted on new tweets well.
No sequential analysis of sentences.
Tensorflow RNN Classifier Summary
No complicated text processing steps to prepare the dataset for modeling.
Not enough information for the RNN model to learn the sequence of a disaster tweet.
The clear winner is traditional machine learning with SVC due to its high accuracy and simplicity.
Text classification in Tensorflow with tweets was a great learning ground because it is extremely powerful, but it doesn’t work well with this type of data.
A better use case would be:
- Medical records
- Government files
- Legal documents
If you are still interested in reading about ways we analyze Twitter data, you can read about it in our NBA Finals Twitter Analysis Recap throughout all 6 games of the 2019 NBA Finals.
We’d like to hear from you
If you are growing your business and want to stay competitive in this rapidly changing marketplace with the latest in analytics, Machine Learning, and AI, contact us at [email protected] or tweet us @springmlinc.