With Artificial Intelligence For Image Recognition, you can perform image recognition with the power of deep learning. AI methods include supervised and unsupervised learning, used for detecting objects and performing image classification, object detection, object segmentation and semantic segmentation on a wide range of images.
Artificial Intelligence for Image Recognition is a self-paced online course that will give you the skills and knowledge in the field of image recognition. You will learn how to create applications using deep learning technology, which can be used to make predictions based on images sourced from different interfaces, such as smartphone cameras, web cameras & social media feeds. You will also explore hidden features of existing photos and videos that are currently available to developers.
Developers can now use artificial intelligence (AI) to help them identify objects in images and recognize what they contain, thanks to a new tool from Bing. The AI-powered tech allows developers to understand their images and determine their context so that they can better understand who’s in the image, for example.
Artificial Intelligence For Image Recognition
The human brain can easily recognize and distinguish the objects in an image. For instance, given the image of a cat and dog, within nanoseconds, we distinguish the two and our brain perceives this difference. In case a machine mimics this behavior, it is as close to Artificial Intelligence we can get. Subsequently, the field of Computer Vision aims to mimic the human vision system – and there have been numerous milestones that have broken the barriers in this regard.
In this section, we cover the 4 pre-trained models for image classification as follows-
1. Very Deep Convolutional Networks for Large-Scale Image Recognition(VGG-16)
The VGG-16 is one of the most popular pre-trained models for image classification. Introduced in the famous ILSVRC 2014 Conference, it was and remains THE model to beat even today. Developed at the Visual Graphics Group at the University of Oxford, VGG-16 beat the then standard of AlexNet and was quickly adopted by researchers and the industry for their image Classification Tasks.
Here is the architecture of VGG-16:
Here is a more intuitive layout of the VGG-16 Model.
The following are the layers of the model:
- Convolutional Layers = 13
- Pooling Layers = 5
- Dense Layers = 3
Let us explore the layers in detail:
- Input: Image of dimensions (224, 224, 3).
- Convolution Layer Conv1:
- Conv1-1: 64 filters
- Conv1-2: 64 filters and Max Pooling
- Image dimensions: (224, 224)
- Convolution layer Conv2: Now, we increase the filters to 128
- Input Image dimensions: (112,112)
- Conv2-1: 128 filters
- Conv2-2: 128 filters and Max Pooling
- Convolution Layer Conv3: Again, double the filters to 256, and now add another convolution layer
- Input Image dimensions: (56,56)
- Conv3-1: 256 filters
- Conv3-2: 256 filters
- Conv3-3: 256 filters and Max Pooling
- Convolution Layer Conv4: Similar to Conv3, but now with 512 filters
- Input Image dimensions: (28, 28)
- Conv4-1: 512 filters
- Conv4-2: 512 filters
- Conv4-3: 512 filters and Max Pooling
- Convolution Layer Conv5: Same as Conv4
- Input Image dimensions: (14, 14)
- Conv5-1: 512 filters
- Conv5-2: 512 filters
- Conv5-3: 512 filters and Max Pooling
- The output dimensions here are (7, 7). At this point, we flatten the output of this layer to generate a feature vector
- Fully Connected/Dense FC1: 4096 nodes, generating a feature vector of size(1, 4096)
- Fully ConnectedDense FC2: 4096 nodes generating a feature vector of size(1, 4096)
- Fully Connected /Dense FC3: 4096 nodes, generating 1000 channels for 1000 classes. This is then passed on to a Softmax activation function
- Output layer
As you can see, the model is sequential in nature and uses lots of filters. At each stage, small 3 * 3 filters are used to reduce the number of parameters all the hidden layers use the ReLU activation function. Even then, the number of parameters is 138 Billion – which makes it a slower and much larger model to train than others.
Additionally, there are variations of the VGG16 model, which are basically, improvements to it, like VGG19 (19 layers).
While researching for this article – one thing was clear. The year 2014 has been iconic in terms of the development of really popular pre-trained models for Image Classification. While the above VGG-16 secured the 2nd rank in that years’ ILSVRC, the 1st rank was secured by none other than Google – via its model GoogLeNet or Inception as it is now later called as.
The original paper proposed the Inceptionv1 Model. At only 7 million parameters, it was much smaller than the then prevalent models like VGG and AlexNet. Adding to it a lower error rate, you can see why it was a breakthrough model. Not only this, but the major innovation in this paper was also another breakthrough – the Inception Module.
As can be seen, in simple terms, the Inception Module just performs convolutions with different filter sizes on the input, performs Max Pooling, and concatenates the result for the next Inception module. The introduction of the 1 * 1 convolution operation reduces the parameters drastically.
Though the number of layers in Inceptionv1 is 22, the massive reduction in the parameters makes it a formidable model to beat.
The Inceptionv2 model was a major improvement on the Inceptionv1 model which increased the accuracy and further made the model less complex. In the same paper as Inceptionv2, the authors introduced the Inceptionv3 model with a few more improvements on v2.
The following are the major improvements included:
- Introduction of Batch Normalisation
- More factorization
- RMSProp Optimiser
While it is not possible to provide an in-depth explanation of Inception in this article, you can go through this comprehensive article covering the Inception Model in detail: Deep Learning in the Trenches: Understanding Inception Network from Scratch
As you can see that the number of layers is 42, compared to VGG16’s paltry 16 layers. Also, Inceptionv3 reduced the error rate to only 4.2%.
Just like Inceptionv3, ResNet50 is not the first model coming from the ResNet family. The original model was called the Residual net or ResNet and was another milestone in the CV domain back in 2015.
The main motivation behind this model was to avoid poor accuracy as the model went on to become deeper. Additionally, if you are familiar with Gradient Descent, you would have come across the Vanishing Gradient issue – the ResNet model aimed to tackle this issue as well. Here is the architecture of the earliest variant: ResNet34(ResNet50 also follows a similar technique with just more layers)
You can see that after starting off with a single Convolutional layer and Max Pooling, there are 4 similar layers with just varying filter sizes – all of them using 3 * 3 convolution operation. Also, after every 2 convolutions, we are bypassing/skipping the layer in-between. This is the main concept behind ResNet models. These skipped connections are called ‘identity shortcut connections” and uses what is called residual blocks:
In simple terms, the authors of the ResNet propose that fitting a residual mapping is much easier than fitting the actual mapping and thus apply it in all the layers. Another interesting point to note is the authors of ResNet are of the opinion that the more layers we stack, the model should not perform worse.
This is contrary to what we saw in Inception and is almost similar to VGG16 in the sense that it is just stacking layers on top of the other. ResNet just changes the underlying mapping.
The ResNet model has many variants, of which the latest is ResNet152. The following is the architecture of the ResNet family in terms of the layers used:
We finally come to the latest model amongst these 4 that have caused waves in this domain and of course, it is from Google. In EfficientNet, the authors propose a new Scaling method called Compound Scaling. The long and short of it is this: The earlier models like ResNet follow the conventional approach of scaling the dimensions arbitrarily and by adding up more and more layers.
However, the paper proposes that if we scale the dimensions by a fixed amount at the same time and do so uniformly, we achieve much better performance. The scaling coefficients can be in fact decided by the user.
Though this scaling technique can be used for any CNN-based model, the authors started off with their own baseline model called EfficientNetB0:
MBConv stands for mobile inverted bottleneck Convolution(similar to MobileNetv2). They also propose the Compound Scaling formula with the following scaling coefficients:
- Depth = 1.20
- Width = 1.10
- Resolution = 1.15
This formula is used to again build a family of EfficientNets – EfficientNetB0 to EfficientNetB7. The following is a simple graph showing the comparative performance of this family vis-a-vis other popular models:
As you can see, even the baseline B0 model starts at a much higher accuracy, which only goes on increasing, and that too with fewer parameters. For instance
EfficientB0 has only 5.3 million parameters!
The simplest way to implement EfficientNet is to install it and the rest of the steps are similar to what we have seen above.