The applications for computer vision are vast, ranging from traffic detection to cancer detection. You may have heard of Amazon even incorporating the technology into their automated grocery store model: Amazon Go.
So how do we give computers the power to understand images? From a high level, the computer learns through the same process that humans do – through repetition. By providing the computer with substantial amounts of labeled images, the computer is able to understand what features in the image itself are associated with the different image labels.
To accomplish this, most modern-day computer algorithms employ two distinct technologies: convolutional neural networks (CNNs), and deep learning. The CNN is responsible for extracting relevant features from the image, starting from low-level features such as edges and curves and eventually piecing these together for holistic representations of objects. Finally, these features are passed into the deep learning network, which makes a prediction through a fully connected layer.
As an application of computer vision, I decided to complete the task of pedestrian detection. Given a video of pedestrians, I would be able to identify when a pedestrian entered the frame as well as keep track of their location in the video through time. Although relatively straightforward, the applications of this project are wide-ranging, especially to enforce occupancy limits during the recent pandemic.
To complete this task, I employed YOLOv4, a pre-trained object detection model that is widely used in industry. Although there have been iterative improvements to YOLO throughout its development, the underlying algorithm remains the same:
1. Divide the image into grids, each having equal dimension.
2. Run detection and classification on each of these grids.
a. Predict bounding box coordinates relative to the grid coordinates for each detected object.
b. Output object label and confidence score for each detected object.
3. Use non-max suppression to only keep detections with highest probabilities and obtain final bounding boxes.
Let’s briefly dive into these techniques in a little more detail to get a better understanding of how they function. Specifically, what information is contained in each bounding box, and how does non-max suppression consolidate the predictions made by each individual grid?
Each bounding box in the image consists of the following attributes: box width, box height, object class, and bounding box center. Object class denotes the prediction of the object within the bounding box.
Next, to understand the concept of non-max suppression, we must first define the criteria it uses as an error. This value, known as intersection over union (IoU), is a metric for the localization error of a bounding box prediction. It is calculated by taking the intersection of two bounding boxes and dividing by the union of their areas – a perfect prediction has an IoU value of 1.
Now, during non-max suppression, YOLO first considers all the bounding boxes for a specific prediction and takes the box with highest prediction probability as a starting point. This bounding box is then compared with all other bounding boxes of that particular class with the IOU serving as a comparison metric. If the IOU is greater than some predefined threshold, then the box with the smaller probability is suppressed and not included in the final prediction. This is because a high IOU denotes a higher chance that the two bounding boxes are capturing the same object. This process is then repeated until all boxes are taken as a prediction or suppressed from the final output.
This architecture that YOLO employs offer a host of advantages over other object detection models such as R-CNN. Most importantly, it’s incredibly fast. Because the model only makes one overall pass on the image, the execution is quite speedy even for video data. Also, the model understands generalized object representation, meaning that it can be applicable to contexts that differ from the one on which it was trained. Finally, the model considers the image as a whole when making predictions, avoiding contextual errors that models like Fast-CNN are prone to.
In summary, computer vision is a rapidly expanding field of artificial intelligence that deals with helping computers recognize images. By automating the “seeing” process for machines, computer vision has augmented and even transformed many industries. Agriculture, for example, is heavily implementing crop yield predictions by feeding drone and satellite imagery of the fields to computer vision models. These models are trained by providing them with ample amounts of labeled data, and most models rely on a CNN and feedforward network to make predictions. Even though the applications may appear complex, many libraries and pre-trained models have been developed that make this technology easy to use. I encourage you to embark on your own computer vision projects!