From self-driving cars to Instagram filters, it is impressive to see applications based on object detection being more available to the general public. For the ones who don’t know, object detection is a field of computer vision that aims to teach computers how to detect different instances of objects in digital images. The following image exemplifies the output of an object detection model, where all the classes of interest instances must be detected as a bounding box with a class label.
It is not hard to see how all the potential of object detection technology in building different applications can make our lives easier and better. However, training a state-of-the-art object detector requires a massive amount of annotated training data. And, with the new transformer-based object detection methods, this demands skyrockets.
Annotating data to train an object detector is tedious, costly, and time-consuming. Just imagine all the steps you need to perform to annotate the image above, then multiply that number by thousands of samples. For example, in the image above, one annotator must execute at least 30 actions to annotate the whole image, since it is necessary to create each bounding box instance and associate them with the correct object class.
This work amount could be acceptable for big companies with big annotation teams. However, it could be a dealbreaker for a small company trying to build a new solution. So, what can we do to overcome this problem in cases like this?
One attractive direction is using a Weakly Supervised Object Detection method (WSOD). WSOD aims to learn how to detect objects in images without bounding boxes supervision. Therefore, the annotator’s job is only to annotate the object classes in the dataset, which is considerably faster than annotating each bounding box instance at a time and works the same way for image classification.
The following image exemplifies the annotation process to a WSOD method.
Now the annotator only needs to perform three actions, i.e., selecting the classes present in the image.
Anyways, you must be wondering:
WSOD is a hot and active research topic, and most methods try to solve the problem as Multiple Instance Learning (MIL) task. In other words, these techniques rely on a set of object proposals, and they try to learn how to select the best candidates from this set for each class. An object proposal is a pre-computed bounding box with a high probability of having an object instance inside. But isn’t it cheating to use these proposals?
No, because they are created with methods without human supervision, such as Selective Search .
To shorten this post, let us summarize the two most valuable developments in WSOD in recent years.
There are other contributions to the field, and if you want to dive deeper into the literature, Zhang et al.  made an extensive survey of WSOD’s state-of-the-art. As expected, the object-bounding boxes learned by the WSOD methods are not perfect, and the mAP of these methods is lower than the supervised counterparts. However, it is possible to use the output of a WSOD as an initial point of training, and from this point, we refine the models in a supervised manner.
You can follow a simple recipe:
Although the results will not be as good as a fully supervised method, this approach can reduce the annotation process time by a lot and be a great start to prove a concept to a client.
This post presented a brief introduction about what WSOD is and how you can apply It to start an object detection project with low-cost annotations. There have been a lot of improvements in mAP in the mAP of WSOD in recent years, and this research field seems promising. For the future, the recent progress in self-learning methods  using transformers is a hot direction for improvements in WSOD. The best object detectors may be trained without any bounding box supervision in the future. Who knows…
 – Uijlings, Jasper RR, et al. “Selective search for object recognition.” International journal of computer vision 104.2 (2013): 154-171.
 – Bilen, Hakan, and Andrea Vedaldi. “Weakly supervised deep detection networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
 – Tang, Peng, et al. “Multiple instance detection network with online instance classifier refinement.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
 – Zhang, Dingwen, et al. “Weakly supervised object localization and detection: A survey.” IEEE transactions on pattern analysis and machine intelligence 44.9 (2021): 5866-5885.
 – Caron, Mathilde, et al. “Emerging properties in self-supervised vision transformers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.