In this blog, we talk about a recent project where we built a Tensorflow Object Detection model to classify among 30 unique classes. This project was very interesting because of the unique challenges it posed and for the learning we had from it.
In this project we were provided around 450 retail shelf images, annotated with 30 classes. These classes were unique products on the shelf. Each image could have multiple classes and many products for the same classes – example 20 coke cans, and 12 sprite bottles. Our goal was to use this data to train a Tensorflow Object Detection model that could correctly identify classes and their bounding box location with 90%+ accuracy. Sounds tough right!
We were able to successfully build an object detection model that detected the 30 classes on the test set with an accuracy > 90%. We built a Faster RCNN Inception model from the Tensorflow Model Zoo.
Let’s talk about the main challenges we had
We received shelf images from the client with a json file for every image that had annotations (bounding box coordinates for each item). These images were very high resolution – 4032×3024 pixels and taken from an Iphone 8 camera. This is good and bad. High resolution images provide a lot of fine data that is useful for the model but the image size makes it hard to feed them into a multi layered model like the Faster RCNN Inception. We decided to downsize the image to a quarter of original size by resizing the width and height to half. The PIL resize function provides an easy way to do this. And you save the resized image to visually inspect their quality. But what about the annotations? They were with respect to the original image. To ensure the annotations translated correctly, we then scaled both the x direction and y direction coordinates to half. As an extra check we used opencv to draw the rescaled annotations on the resized image to ensure they fit perfectly.
In a perfect world all the 30 classes would have the same number of annotations. However our real world dataset was quite imbalanced. The more popular products were often on the shelves and had many more annotations then the infrequently occurring ones. We did some amount of class balancing to help the model generalize.The first thing we ensured was that all 30 products had a minimum number of annotations which was around 100 for our use case. Then the top 3 most frequent classes were under sampled by selectively writing annotations to tfRecords
Once the model was trained, we could run it on images to get predictions of class and bounding box. Since we had 30 classes some of which were quite similar looking, how do we check the quality of predictions? We did that by looking at the ground truth image and the predicted image at the same time. Any easy trick of doing this was to use np.hstack to stack 2 images side by side – ground truth with classes and bounding box on the left and the predicted image with classes and bounding box on the right.
We also wanted to assess the accuracy of the model across all of test set by looking at true positive rate, false positive rate and false negative rate by class. Lets first define the terms:
True Positive – If for an item in the ground truth image, the same item is at the same location in the predicted image
False positive – An item in the predicted image is not at the corresponding location in the ground truth image. This could be either because that item is not in the ground truth or is there but associated with a different class
False negative – An item is present in the ground truth image but not in the corresponding location in the predicted image. Again this could be because the item is not present or present but associated with a different class
As you can imagine, computing these metrics is tricky since we need to compare bounding boxes across ground truth and predicted images. The predicted image may have a bounding box in the same location but the shape of the box might be different or the box might be slightly offset. There can be multiple ways of comparing 2 rectangles. One option is to look at their area of overlap and if it is above a certain threshold then we can say the 2 boxes are comparable. Another approach is to compare the centers of the 2 rectangles and if their euclidean distance is less than a threshold then they are comparable. We chose the latter option in our calculation and decided the threshold by experimenting with a couple of images.
After some experimenting, our code could go image by image, box by box and mark every box as true positive, false positive or false negative. Finally we aggregated the results across the entire test set to see the performance of the model by UPC.
Our results showed that we had a true positive accuracy > 90% for all classes and minimal false positive and false negative predictions.