So how did I do this?
- I found a video on youtube which displayed streets riddled with potholes and used it for a test case.
- Then I downloaded that video using PickVideo.
- I sliced the video into many images with Python code and referred by this StackOverflow post.
- I downloaded this repo on Github and followed the readme instructions.
- I then ran the existing code used to detect elephants in a photo to get the hang of the code before I tried to build my own data set.
- I did a large amount of labeling. I used Label IMG to identify areas of interest with bounding boxes. Potholes can be tricky, since they are different shapes, so I spent about a day labeling so the algorithm could detect potholes of different sizes. This process uses xml files, which were later converted into CSV format.
- I then followed this training procedure from the same repo I mentioned earlier.
- After the training was done, I applied it to various snippets of video to highlight the potential of object detection on potholes.
Overall, I thought the experimentation was great. But I wanted to take this further with Mask R-CNN, a relatively new addition to object detection that I will talk about in the next section.
This process takes object detection further by adding an object mask in parallel with an existing bounding box for the region of interest shown in the GIF below.
This algorithm masks and detects the object of interest in parallel, which greatly increases the accuracy.
A company that has really been pushing the envelope of this field is MatterPort, a firm based out of the San Francisco Bay Area. They have kindly open sourced their work for others to reproduce. A great post from them on how to implement their code can be found here.
The steps to train were very similar to object detection. I already downloaded a video and sliced it into many different images. But, I used a different software to annotate the images called VVG Image Annotator. Instead of drawing bounding boxes, I drew polygons around the photos and saved the annotations in a JSON format, instead of XML.
I also followed the data inspection and model inspection notebooks, but adapted the code for pothole detection. I also modified their training file as well and trained the model overnight. There was a concern that CPUs were not good enough, but I felt that the training was just fine on my local machine. I have an Intel Core i5 with 8 GB of RAM.
Here are a few examples of the output after everything was done:
Notes & Statistics:
Mask R-CNN can detect how many objects there are by assigning different colors randomly. But, after each image frame, the colors changed. I modified the code to keep the colors consistent.
Here are some statistics from both examples to compare.
SSD:
- Train: 1262, Test: 450
- Pretrained model: SSD MobileNet, pre-trained on the Imagenet, dataset
- Training Steps: ~ 1000
- Code: Tensorflow
Mask R-CNN:
- Train: 71 , Test: 25 (very impressive!)
- Pretrained model: ResNet101 from Keras, pret-trained on the MS COCO dataset
- To find out more about Resnets, check out this article here.
- Training Steps: ~ 600 steps
- Code: Tensorflow and Keras
How the Mask R-CNN code was better than SSD on MobileNet code
- Drawing a bounding box is easy, but I believe that there are details that are included that do not pertain to the box. Drawing polygons is more difficult, but I focused the object particularly towards potholes, so this is more of quality vs quantity issue.
- I believe that the addition of instance segmentation greatly helps to identify objects with very little labeling because this additional mask blocks out extra noise around the photo along with the overall training process.
- The Mask R-CNN code also uses a Feature Pyramid Network, which can detect the same image on multiple scales. This solves the problem of doing lots of annotations to get good results.
There were also some misclassifications on both methods
Object Detection Without Masking

Mask R-CNN

Still, I feel the results of it are fantastic despite a few misclassifications.