This blog post is meant to help others on how to develop new features for your next data science model. There is a lot of information on the internet on feature selection and how models work, but not much on how to actually generate the features for the model in the first place. Data Science is touted as the sexiest job of the 21st Century , but unless there is a clean dataset ready for you with plenty of useful features, most of the work in data science could involve this very task. This was a task for one of our most recent clients working on detecting airport behavior.
There is no an exact science to feature engineering, but I boiled it down to a few key things:
There is a child-like mentality of curiosity that is required for this, which is the most important part of the process. The main objective for me was to find good features to clearly separate airports. Is it runway length? Is it the number of runways? Is it airport size? Is it presence of a control tower? Etc. Keep asking yourself questions and stay curious.
2) Educate yourself on the domain space.
Some of the features were obvious in the beginning such as finding ICAO and IATA IATA codes to identify the airports. But reading about the domain definitely helped in coming up with good features that were not obvious to me in the beginning such as the traffic pattern altitude, runway gradient, glide slope indicator, and the cross height. I looked up sites such as SkyVector, AirportGuide, and AirNav to learn more about airport data. I studied the terminology to see why this website was collecting this piece of information, which inspired me to develop ideas for more features.
3) Any idea is fair game
At this stage, any idea for a feature can be good. No idea is a bad idea for now. It was more important at this point to generate as many ideas as possible rather than throw away an idea. There are plenty of ways to test if the feature is relevant later down the road.
4) Collect the data
Once I decided on an idea for a feature, I tried to see if there was enough information about it for all the airports on the list. For example, when I was finding information on runway length, I tried to find as much information as I could on different web sites and wrote code to web scrape it. Even if a feature was sparse, it can still be useful. There are techniques to handle sparse data later down the road. But for now, it is best to go to town on the data collection on the features you’ve decided on. Don’t be stuck on an idea for a feature either, you never know what could be helpful later in the modeling process.
5) Use features to make other features out of it.
With existing features, there is a possibility to generate new features from it. For example, I took the cross height and the glide slope angle and used trigonometry to find the air distance, which is the distance that the airplane is hovering above the runway before it lands. Another colleague of mine used existing data and derived the median values out of it for airport activity at 20 minute intervals. There could be other features generated with this method as well, but there was a limited amount of time to experiment with this concept.
By staying curious, educated about the domain, and not afraid to try new ideas, you can generate lots of great features for your next predictive model. The web scraping and the feature selection process can be automated, but this process of brainstorming ideas for new features is still a challenge.