When you look at an image, you see objects— perhaps individual people, stoplights, cars, and other items. Whatever the image contains, you see the objects and understand what they are. A computer, however, sees pixels—a 2D image that contains numeric values that translate into color when presented on a screen. In order for a computer to see the objects that you see, it requires some sort of deep learning technology, such as a Convolutional Neural Network (CNN).
The concept of computer vision started in 1966 (yes, that long ago) when Seymour Papert and Marvin Minsky launched the Summer Vision Project, a two-month, ten-person effort to create a computer system using symbolic AI that could identify objects in images. To accomplish this task, the computer would have to move from working with pixels to identifying which pixels belonged to a particular object. Given the technology of the time, the Summer Vision Project didn’t get far.
The next attempt came from a Japanese scientist, Kunihiko Fukushima, in 1979, who proposed the Neocognitron. This project was based on neuroscience research performed on humans, and it attempted to perform its task in a human-like manner. The Neocognitron was successful in a very basic way, but it, too, failed at tasks of any complexity.
The first success came in the 1980s with the efforts of French computer scientist Yan LeCun, who built the CNN, which is inspired by the Neocognitron. CNNs are the building blocks of deep learning–based image recognition, yet they answer only a basic classification need: Given a picture, they can determine whether its content can be associated with a specific image class learned through previous examples. Therefore, when you train a deep neural network to recognize dogs and cats, you can feed it a photo and obtain output that tells you whether the photo contains a dog or cat. The outputs generally come in two forms:
- If the last network layer is a softmax layer, the network outputs the probability of the photo containing a dog or a cat (the two classes you trained it to recognize) and the output sums to 100 percent.
- When the last layer is a sigmoid-activated layer, you obtain scores that you can interpret as probabilities of content belonging to each class, independently.
- The main object isn’t what you trained the network to recognize. You may have presented the example neural network with a photo of a raccoon. In this case, the network will output an incorrect answer of dog or cat.
- The main object is partially obstructed. For instance, your cat is playing hide-and-seek in the photo you show the network, and the network can’t spot it.
- The photo contains many different objects to detect, perhaps including animals other than cats and dogs. In this case, the output from the network will suggest a single class rather than include all the objects.
- Detection: Determining when an object is present in an image. Detection is different from classification because it involves just a portion of the image, implying that the network can detect multiple objects of the same and of different types. The capability to spot objects in partial images is called instance spotting.
- Localization: Defining exactly where a detected object appears in an image. You can have different types of localizations. Depending on granularity, they distinguish the part of the image that contains the detected object.
- Segmentation: Classification of objects at the pixel level. Segmentation takes localization to the extreme. This kind of neural model assigns each pixel of the image to a class or even an entity. For instance, the network marks all the pixels in a picture relative to dogs and distinguishes each one using a different label (called instance segmentation).
Computer vision is an essential part of the future of deep learning. You can find examples of how to implement basic computer vision in Deep Learning For Dummies, by John Paul Mueller and Luca Massaron (Wiley).