This blog focuses on feature selection and data exploration.
AzureML provides an easy way to visualize data. Histograms and boxplots are two of the most common way to visualize data and understand data attributes. A histogram is a graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable). The body of the boxplot consists of a “box” (hence, the name), which goes from the first quartile (Q1) to the third quartile (Q3). Within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers, extend from the front and back of the box.
You also have the ability to write R script to create plots. For example the below chart is created by using the following line of code:
boxplot(as.numeric(Interest.Rate)~as.numeric(FICO.Score),data=loansData.tmp, main=”Box plot grouped by FICO score”, xlab=”FICO Score”, ylab=”Interest Rate %”)
AzureML provides various options for preprocessing data. This is needed to make discovery easier and more importantly make it accurate.
In my experiment one of the components I made use of is the Clean Missing Data operation. It provides various ways to deal with missing data as shown below.