Loan data analysis using Azure ML

Here’s my first AzureML model:  The model uses a two class logistic regression algorithm for binary classification. This is based on Python based sample from It leverages various AzureML Studio components as well as custom R code for data cleansing and feature engineering.

Here are the various steps I followed to get the model working.

  1. Being that I am still new to AzureML, I recreated the model first in my local RStudio environment. This is obviously not required but I did it just to get comfortable and make the AML Studio learning process quicker.
  2. Data exploration. I imported the data into AzureML as a dataset.  I knew there were several things I needed to do to clean up the data.  I did this as a two step process:
    1. Use an R script to apply some functions to the data e.g. remove % sign that existed at the end of the interest rate value
    2. I used the “Metadata Editor” AML Studio component to convert some columns from String to Integer
  3. Handling missing values. AML Studio has a component called “Clean Missing Data.”  I used this to remove entire rows whenever the row has some missing values.  The nice thing with this component is that it has other “cleaning modes” – for example, it provides an option to replace using the MICE process.
  4. I did some further data clean up using an R script. I noticed that the monthly income column had some outliers.  I used a simple R script to calculate inter-quartile ranges and removed outliers.
  5. For model creation, I used the “Two Class Logistic Regression” component and trained the model on training data which was created using the Split function. The model was then scored on the test data and evaluated.
  6. The “Evaluate Model” step provides an easy way to get various metrics of the model including AUC.

In the next blog I will get into a little more details on certain areas of the above model creation process.