SpringML team built a Pneumonia detection model on the Kaggle RSNA Pneumonia detection data set. The model performed very well getting to an average precision score of 0.207 on the private leaderboard. This score is among the top 30 participants on the leaderboard. We built a Retina Net model and implemented post-processing to improve model performance. In this blog, we talk about the data set, our model choices and lessons learned.
Lung Opacities and Pneumonia
Let’s look at a sample image from the data set which shows a chest X-ray.
In an X-ray, tissues with sparse material, such as lungs which are full of air, do not absorb the X-rays and appear black in the image. Dense tissues such as bones absorb the X-rays and appear white in the image. In short –
- Black = Air
- White = Bone
- Grey = Tissue or Fluid
Now let’s look at an image from the data set with lung opacities. Opacity here is loosely defined as any area of the chest x-ray which appears whiter than it should be as shown below. These areas look like “fuzzy clouds”.
Usually the lungs are full of air. When someone has pneumonia, the air in the lungs is replaced by other material – fluids, bacteria, immune system cells, etc. That’s why areas of opacities are areas that are grey but should be more black. When we see them we understand that the lung tissue in that area is probably not healthy.
What makes the problem more challenging is that not all opacities are symptomatic of Pneumonia. See example of a lung that is not normal because it has opacities that are nodular but these types of opacities are not associated with Pneumonia. Pneumonia-related opacities look more like fuzzy clouds.
About the Data Set
The data set that has been shared has around 21,000 X-ray images similar to those shown above. The images are annotated with bounding boxes to highlight the region in the X-ray that is indicative of possible Pneumonia.
But there are several reasons this data set is different from a typical object detection data set.
- Many images have no annotations – The competition has included images of healthy lungs as well as images with unhealthy lungs but not indicative of Pneumonia. These images are called negative images and are a big portion of the overall data set
- The boundary indicative of Pneumonia – A bounding box has been used to annotate the boundary of the Pneumonia region but this can be quite subjective and there is a known variability between radiologists in the interpretation of chest radiographs. One of the ways for improving the score on the competition was to reduce the predicted bounding box sizes as that better aligns with the boxes in the test set. More on this later.
The data set has X-ray images in dicom format. These images were converted to jpg for modeling.
We tried two different object detection models:
We trained a Faster RCNN Resnet from Tensorflow Object Detection API. One of the gaps with the object detection API is that it is unable to properly process negative examples. In general it’s not necessary to explicitly include “negative images”. What happens in these detection models is that they use the parts of the image that don’t belong to the annotated objects as negatives. However in this case since a big portion of the data set is a negative examples that are different from positive examples, Tensorflow is not able to train properly. The loss function doesn’t penalize for detections on negative images. As a result the model has many false positives as shown in the image below where the not normal lung but no Pneumonia case is annotated as Pneumonia.
Retina Net is a single stage detector that uses Feature Pyramid Network (FPN) and Focal loss for training. Convolution networks produce feature maps layer by layer and due to pooling operation the feature maps have a natural pyramidal shape. However one problem is that there are large semantic gaps between different layers of features. The high resolution maps (earlier layers) have low-level features that harm their representational capacity for object detection. Feature pyramid network solve this by combining low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections. The net result is that it produces feature maps of different scale on multiple levels in the network which helps with both classifier and regressor networks.
The Focal Loss is designed to address the single-stage object detection problems with the imbalance where there is a very large number of possible background classes and just a few foreground classes. This causes training to be inefficient as most locations are easy negatives that contribute no useful signal and the massive amount of these negative examples overwhelm the training and reduces model performance.
The Keras based RetinaNet model is able to learn from negative examples.To train a RetinaNet, firstly the annotations from Kaggle were converted to a format suitable for training. The RetinaNet model was trained with a Resnet50 backbone for around 25 epochs. The input images into the model were resized to 224×224 and random augmentations were added.
Model Analysis and Post Processing
The trained RetinaNet model performed quite well – getting to a mean average precision of 0.207 on the Kaggle private leaderboard. We also analyzed model performance by looking at the errors it made in terms of true positive, false positive and false negative predictions.
As shown below, the RetinaNet model was able to not predict on images if they looked normal.
The model was able to detect Pneumonia in both lungs (shown as green true positive boxes) however the model did return many predictions as shown by red positive boxes on the left. This is because we kept the score threshold to >0.20.
Once the model was trained, we realized that it performed better by keeping score threshold quite low – around 0.20. This resulted in lower false negatives. However, it did cause the model to predict multiple bounding boxes for each pneumonia detection. We resolved this by reducing the non max suppression threshold so that even if there was a small overlap in boxes, only the box with higher confidence was chosen. Overall this strategy worked quite well.
Finally, another post processing technique that worked well was shrinking the predicted boxes a bit. This worked because different radiologists have different opinions on the extent of Pneumonia. So overall the model predicted bounding boxes were somewhat on the larger side and reducing them resulted in a better score.
Working on this challenge was quite fun. The SpringML team learned a lot and performed really well. The main learning:
- How Pneumonia shows up in X-ray scans
- Better ways to deal with negative examples
- Training object detection models
- Impact of post-processing on model performance