This blog was co-authored by SpringML data scientists Chaitanya Nagpal and Ashish Agarwal.
The world is currently facing the first pandemic disease of this century with Coronavirus. Even with governments locking down countries, the number of infected patients continues to rise. As on March 30th, 2020 while writing this article, currently there are more than 770,000 cases across the world.
In this blog we will explore how deep learning techniques developed for image analysis and classification can be leveraged to detect if lungs have been infected due to coronavirus.
This is experimental in nature and showcases SpringML’s ability to leverage public datasets to build Machine Learning models quickly. It’s not intended for diagnostic purposes.
For any machine learning problem, good quality labeled data is key. Currently, Covid-19 lung scan datasets are limited. But the best collection is found on the Github project COVID-19 open-source dataset. It consists of scraped COVID-19 images from publicly available research, as well as lung images with different pneumonia-causing diseases such as SARS, Streptococcus, and Pneumocystis.
For our detection purpose, we filtered out Covid-19 X-ray scans with the Posteroanterior view of lungs. To create a balanced dataset, we added X-ray scans of healthy individuals from the Kaggle dataset Kaggle’s Chest X-Ray Images (Pneumonia) dataset.
After gathering my dataset, we were left with 138 total images, equally split with 69 images of COVID-19 positive X-rays and 69 images of healthy patient X-rays. The left image shown below is an example of a patient with positive Covid-19 whereas the one on the right is a healthy individual.
Since the number of images is limited, we decided to upscale the number of images using some of the image pre-processing techniques such as flipping and kernel sharpening. After upscaling we split the number of images for detection as follows:
Training – 432 images
Validation – 48 samples
Testing- 36 images
We used two approaches to create models for Covid-19 detection. The first approach is creating a Convolution Neural network using Keras, TensorFlow, and deep learning. The second approach is making use of AutoML in the Google Cloud Platform.
For the first model, we created a CNN with several convolution layers as follows.
The model was compiled with Adam optimizer and was fit using 25 epochs. At the end of the training, we were able to obtain the following results:
Training loss: 0.0238
Training acc: 0.9907
Validation loss: 0.0598
Validation accuracy: 0.9792
We tested on 36 unseen images and a precision rate of 96 percent was achieved.
For the AutoML approach, we uploaded the COVID-19 and normal images to a Google Cloud Storage (GCS) bucket trained using AutoML Vision.
We were able to achieve a precision rate of 0.989
For testing on unseen images, we were able to get all detections correct except 1 out of 36 scans. The confusion matrix is seen as below:
Although the model performs really well on unseen data, there is still not enough data. This is very much a proof of concept and we are trying out a few other options to continue to improve the models further. We will post updates here as and when we get more results.
Check out a demo of this quick turnaround capability in Google Cloud on YouTube delivered by SpringML’s CTO, Girish Reddy.
SpringML delivers data-driven digital transformation outcomes with an experimentation and design thinking mindset. We provide Google Cloud consulting and implementation services and industry-specific analytics solutions that deliver high-impact business value from data. SpringML is a premier Google Cloud partner with capabilities to plan, assess, deploy, and manage data-driven engagements. We have been awarded Google Cloud specialization based on our expertise and customer portfolio for Data Analytics, Machine Learning, and Marketing Analytics.