Being a data science enthusiast, I spend a lot of my time researching about what new and exciting things are going on in the data science and machine learning realm. I actively read research papers and tutorial websites to try to differentiate the methods being carried out by experienced data scientists from those that like to flaunt complex model architectures. In my search for data science tutorials, I have noticed a tendency among some data science bloggers to just throw some data into a ridiculously complex neural network model, achieve great performance on the dataset, and call it a day.
To them, my professors say: “A good feature is much more valuable than a great model.”
To them, I say: “No, you do not have to flex a bidirectional, convolutional, long short-term memory network with dynamic dropout layers to estimate house prices. Housing prices can be estimated pretty accurately using clean, highly informative features (such as location, square footage, yard size, etc. ) and a linear regression model.”
An unnecessarily high level of model complexity may not be computationally nor predictively optimal – a property that is governed by the Bias-Variance Tradeoff. The property describes the “Occam’s Razor”-esque relationship between model complexity and the total error of the model. As seen in the image below, the way to achieve optimal performance is by choosing a model that is complex enough to accurately capture relationships in the data, but not so complex that the model learns the relationships that exist in the training data too well and then fails during unique real-world scenarios (a problem called overfitting).
If you do have access to data that captures all of the relationships which exist between features in the entire population in question, then you could use a complex model that will provide state-of-the-art results in real-world applications. The success of real-world deep learning deployment can be seen in areas like autonomous vehicles and medical imagery, among many others.
The caveat of these successes lies in the fact that these deployments have gone through thousands of hours of development and all have had access to and produce an immense volume of data. As in each self-driving car on the road produces and consumes 40 terabytes of data from 8 hours of driving, or roughly 75% of the data that the average person produces in a year. One company called Waymo, a self-driving taxi service operating since 2009, has accumulated 600 vehicles that collectively have 1,400 years of driving experience.
The amount of data that is generated in these drives is enough to capture the complexities that exist in the real-world driving setting, and therefore the products achieve fantastic results. However, if you do not have access to both the data and work hours that are required to develop deployable deep learning models, it does NOT mean that you are out of luck! By effectively using feature engineering to transform your data, you are able to do things like find hidden trends or variables and remove pesky noise in data that clouds the true relationships that exist in the real world. Once you have the improved features, you may be able to achieve a much more accurate model.
The use cases for cleaning and transforming your data are abundant and can be used on a variety of kinds of data. As can be seen in the image to the right, data scientists spend the majority of their time collecting, cleaning and organizing their data in a way that makes it easier for models to capture the relationships in the data.
Rather than relying on more complicated models to extract information out of features, data scientists perform operations on the data to make the features more informative. Let’s look at a couple kinds of transformations that data scientists use to enhance the quality of their data to achieve better performance with their model.
Discretization is a method that involves putting values of a continuous variable into bins, thus making it a discrete variable. For example, discretizing height measurements could result in height measurements of under 5 feet, 5-6 feet and over 6 feet.
There are pros and cons to using discretization. By using the bin values instead of the actual height values, less weight is given to a person that happens to be like 8 feet tall and could throw off the learning process for algorithms. Some machine learning algorithms also benefit from drawing inferences from discrete data, as opposed to continuous. It is known, however, that discretizing data can result in a feature that is less informative. Predicting the shirt size of a person based on only knowing that they are between 5 and 6 feet tall would basically get you nowhere, whereas knowing the person’s actual height could give a decent indicator of what shirt size they would wear.
The loss of information that occurs with discretization can be fixed by choosing a larger number of buckets. One way to choose the optimal number of buckets is to choose the smallest number of buckets whose variance is at least 80% of the variance of the actual data. While the percentage can be raised or lowered based on the data, the basic idea is to preserve most of the information that is present in the data, while also being able to reap the benefits of discretization.
In this experiment above, data was discretized and then appended onto the original data, so the algorithms were trained on both the continuous and discrete data together. As can be seen in the results, adding discrete data to the dataset can be beneficial for some algorithms, but can also be harmful to others.
Depending on the type of data that you are working with, running decomposition transformations on your data can be beneficial for different reasons. Essentially, decomposing your data gives you a better sense of what is going on behind the scenes. You are able to identify significant components in the dataset, and can use that knowledge to reduce dimensionality or identify hidden trends or variables.
One example of where decomposing your data could prove to be beneficial is when you’re working with a time series – some of the most common modeling projects in the sales realm. Whenever you’re looking to forecast sales, demand, etc. a time series model could be used. The issue with building a time series model is that there can be a “signal” in the background that is influencing the behavior of the data without you knowing. A surprising and timely example of hidden influence in time series data can be seen when looking at the S&P 500 Index valuation change over the course of 2020.
When looking at the S&P 500 valuation over the time period shown above, it looks like the index made a gain of about 8%. This valuation happened to be 3% higher than the pre-2020 high. When the largest 5 stocks in the index are removed, however, it shows that the rest of the entire index actually lost 4% of its valuation. In this case, the hidden trends in the data were strong enough to influence a double-digit percentage change – from a mild loss to a significant gain.
When you have prior knowledge about the influence that these companies have, then it’s easy to see just how much impact the individual stocks have on the price. However, it is often unknown how many individual influences there are on data and just how much they impact the result. This is where decomposition methods come in, one of the most common ones being Principal Component Analysis (PCA). PCA decomposes a dataset into its most important components, allowing you to get rid of the others. One issue with PCA is that a “component” does not necessarily refer to a feature but rather some combination of multiple features. If a component is not highly correlated with any features, then it is hard to interpret the values in the same context as you were before. Despite not being interpretable, PCA has been shown to significantly improve training speed while keeping similar accuracies for some cases, while others have seen large increases in accuracy.
3. Kernel Functions
Sometimes data can follow distributions that don’t seem to have any clear relationships, but a significant trend can be found through putting the data through a kernel function. Kernel functions are used in machine learning to transform the distribution of the data, in order to have data that is better suited for learning. Support vector machines use kernel functions extensively to create linearly separable data, but kernel functions can be used to change the distribution of the data for any algorithm.
In the image above, the results of kernelization are shown in application to a classification problem. The same methods are able to be used for regression. Since linear regression models assume that your data follow a normal distribution, kernel functions can be used to force your data to follow one. If there is an intrinsic normal distribution to your data, then choosing the correct kernel function could result in data that is able to be modeled easily.
The image to the left is a scatterplot of population and area for every sovereign country and dependent territory. From visualizing the raw data, it may not be apparent that a relationship exists. There is a clear relationship, however, after both variables are put through a log transformation. The transformation revealed the hidden (but strong) relationship that exists between the features, making it possible to use a simple linear regression model to accurately estimate area by using only population data.
Wouldn’t it be great if all data had easily findable, hidden relationships that were as informative as well as the ones I mentioned in this post? Obviously, real-world data contains problems that don’t exist in these examples. The methods addressed in this piece are just a small sample of the tools that are available. There are a vast amount of data science tutorials and blogs that post great content on how to transform raw data into useful, informative features. So, before you decide on using a complex model, take the time to first explore your data. You never know what you may find!