Hands-on guide to using Vision transformer for Image classification – Analytics India Magazine

Vision transformers are one of the popular transformers in the field of deep learning. Before the origin of the vision transformers, we had to use convolutional neural networks in computer vision for complex tasks. With the introduction of vision transformers, we got one more powerful model for computer vision tasks as we have BERT and GPT for complex NLP tasks. In this article, we will learn how can we use a vision transformer for an image classification task. For this purpose, we will demonstrate a hands-on implementation of a vision transformer for image classification. Following are the major points to be covered in this article. 
Step 1: Initializing setup
Step 2: Building network
Step 3 Building vision transformer 
Step 4: compile and train
Let’s start with understanding the vision transformer first.
Vision transformer (ViT) is a transformer used in the field of computer vision that works based on the working nature of the transformers used in the field of natural language processing. Internally, the transformer learns by measuring the relationship between input token pairs. In computer vision, we can use the patches of images as the token. This relationship can be learned by providing attention in the network. This can be done either in conjunction with a convolutional network or by replacing some components of convolutional networks. These structures of the network can be applied to the image classification tasks. The full procedure of image classification using a vision transformer can be explained by the following image.
Image source
In the above image, we can see the procedure we are required to follow. In this article, we are going to discuss how we can perform all these steps using the Keras library. 
For this implementation, we will take the following steps. 
 In this section, we will be performing some of the basic procedures of modelling like importing datasets, defining hyperparameters, data augmentation, etc.
Let’s start by obtaining data. In this procedure, we are going to use the CIFAR-10 dataset provided by the Keras library. In the dataset, we have 50,000 images of size 32×32 in the training dataset and 10,000 images of the same size in the test dataset. We have the following labels with these images:
We can call this dataset using the following lines of codes.
Checking the shape of the datasets.
In this section, we will define some of the parameters that we will use with the other sub-processes.  
Considering the above parameters, we can say that in the process we will be using 100 epochs in the training and will resize the image and convert the image into patches.
Now, we will call the important libraries.
In the procedure, we will provide augmented images to the transformer. In the augmentation we will normalize and resize the images then we will randomly flip the images. This procedure will be completed in sequential methods and using the Keras provided layers.
In the final step of the augmentation, we will compute the mean and the variance of the training data for normalization.  
 Let’s see how the images will look in the dataset.
The above output is the example of an image in the dataset, since images in the data have a low size it is not clearly visible. Now we can proceed to our second step.
In this step, we will be building a network where we will use an MLP network and a layer that will separate our images into patches. Also, we will use a patch encoder to transform the patches where it will project the patches into vectors of size 64. Let’s start by building an MLP network.
In the above codes, we can see we have built an MLP network that is simply having a dense layer and a dropout layer. 
In this step, we will define a network that can convert the images into patches. For this, we mainly use the tensor flow provided extract_patches module.  
In the above output, we can see that we have converted the images into patches using which the vision transformer will learn to classify the images.
This patch encoder will perform the linear transformation of the image patches and add a learnable position embedding to the projected vector.
After building this network we are ready to build a vision transformer model.
In this section, we will build blocks for the vision transformer. As discussed and implemented above we will use the augmented data that will go through the patch maker block and then the data will go through the patch encoder block. In the transformer block, we will use a self-attention layer on the patch sequences. Output from the transformer block will go through a classification head which will help in producing the final outputs. Let’s see in the below codes.
Using the above function we can define a classifier using the vision transformer in which we have provided the methods for data augmentation, patch making, and patch encoding. Encoded patches will be our final input as image representation to the transformer. Flatten layer will help us to change the shape of the output.    
In this section we will; compile and train the model that we have created and after that, we will evaluate the model in terms of accuracy.
Using the below line of codes we can compile the model.
In the compilation, we have used Adam optimizer with sparse categorical cross-entropy loss.
Training of the transformer can be done using the following lines of codes:
In the above output, we can see that the training has started. It may take a significant amount of time. So to do it fast, it is recommended to enable the GPU during the training. In Google Colab, we can find out the GPU setting in the manage runtime session tab under the runtime tab.
Let’s check the accuracy of the vision transformer in the image classification task.
Here in the above output, we can see that our model has a performance of 84.21% accuracy and our top 5 accuracies are 99.24%.  
In this article, we have introduced the transformer and seen how it can be used for image classification through an image. We implemented a vision transformer for image classification using the CIFAR-10 dataset and followed all the steps that were included in the image.  We achieved a very good result in the task using this transformer.
8th April | In-person Conference | Hotel Radisson Blue, Bangalore
Organized by Analytics India Magazine
View Event >>
30th Apr | Virtual conference
Organized by Analytics India Magazine
View Event >>
Different distributions of data and their properties is one such area of statistics in which a data scientist has to have crystal clear clarity.
This article we will walk you through and compare the code usability and ease to use of TensorFlow and PyTorch on the most widely used MNIST dataset to classify handwritten digits.
In this article, we will learn how we can build a simple neural network using the PyTorch library in just a few steps. For this purpose, we will demonstrate a hands-on implementation where we will build a simple neural network for a classification problem. 
Meta has released the AI Research SuperCluster (RSC), calling it one of the fastest AI supercomputers running presently in the world.
NVIDIA’s report is based on responses from over 500 C-suite executives, developers, data scientists, engineers and IT teams working in financial services.
What we refer to as coding skills for data science are in fact the ability to think logically and understand underlying data structures.
Bandyopadhyay was the first female director of ISI and is a member of the PM-STIAC of the Prime Minister of India.
LaMDA is built by fine-tuning a family of Transformer-based neural language models specialised for dialog, with up to 137B model parameters.
These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.
Though the H-1B remains the most popular visa type, it’s hard to get.
Stay Connected with a larger ecosystem of data science and ML Professionals
Discover special offers, top stories, upcoming events, and more.
Stay up to date with our latest news, receive exclusive deals, and more.
© Analytics India Magazine Pvt Ltd 2022
Terms of use
Privacy Policy