CLIP vs Vision Language Pre-training Vs VisionEncoderDecoder – Analytics India Magazine

One of the many technological advancements in the field of language models are vision and language variant models. Major tech and research companies have come up with such models, such as OpenAI’s CLIP, Hugging Face’s VisionEncoderDecoder and VLP (Vision Language Pre-training). These seemingly similar models can be confusing to understand to decide which one will be the right choice to apply for a particular setting.
Recently, Prithivi Damodaran, AVP ML R&D at Antworks, who frequently posts about concepts in data science, AI, and NLP, recently posted on LinkedIn on how to understand which one works for you and what to apply in which situation. Let’s try to understand what CLIP, VLP (Vision Language Pre-training) and VisionEncoderDecoder are and what makes each of these unique.
Image: LinkedIn
Released in January last year, Contrastive Language–Image Pre-training, or CLIP, is built on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning. OpenAI showed that scaling a simple pre-training task is sufficient to achieve competitive zero-shot performance on a wide range of image classification datasets. This method uses available sources of supervision – the text paired with images found on the internet. The data is used to proxy training tasks for CLIP where given an image, it predicts out of a set of 32,768 randomly sampled text snippets, which was actually paired with it in their dataset.
To do this, OpenAI said, CLIP models, will have to learn to recognise a huge variety of visual concepts in images and associate them with their names. Then, they can be applied to nearly arbitrary visual classification tasks.
OpenAI said that it designed CLIP to solve various issues that exist in deep learning methods in computer vision
Hugging Face’s VisionEncoderDecoderModel is a generic model class that will be instantiated as a transformer architecture by having one of the base vision model classes of the library as the encoder and another one as the decoder. Hugging Face says that it can be used to initialise an image-to-text-sequence model with any pretrained vision autoencoding model (ViT, BEiT, DeiT) as the encoder and any pretrained language model (RoBERTa, GPT2, BERT) as the decoder.
After such a Vision-Encoder-Text-Decoder model has been trained or fine-tuned, it can be saved/loaded just like any other model.
Damodaran says that Unified VLP models are typically pre-trained on a large number of image-text pairs with “creative” self-supervised objectives and loss functions. It can give better vision and language alignment as compared to using vision encoder and language decoder that is trained in isolation.
To understand better what VLP means, in the paper titled “Unified Vision-Language Pre-Training for Image Captioning and VQA”, the authors mention that the model is unified by fine-tuning for either vision-language generation or understanding. It uses a shared multi-layer transformer network for both encoding and decoding instead of methods where encoder and decoder are integrated using separate models. 
The unified VLP model is pre-trained on a large number of image-text pairs using the unsupervised learning objectives of bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. 
The team added that VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks (image captioning and visual question answering) across three benchmark datasets of COCO Captions, Flickr30k Captions, and VQA 2.0.
8th April | In-person Conference | Hotel Radisson Blue, Bangalore
Organized by Analytics India Magazine
View Event >>
30th Apr | Virtual conference
Organized by Analytics India Magazine
View Event >>
OpenAI has introduced embeddings, a new endpoint in the OpenAI API, to assist in semantic
Bandyopadhyay was the first female director of ISI and is a member of the PM-STIAC of the Prime Minister of India.
LaMDA is built by fine-tuning a family of Transformer-based neural language models specialised for dialog, with up to 137B model parameters.
In this article, we will discuss how SMOTE technique can be used to improve the performance of weak learners such as SVM.
Embeddings can be the subgroups of a group, similarly, in graph theory embedding of a graph can be considered as a representation of a graph on a surface, where points of that surface are made up of vertices and arcs are made up of edges
A Rhodes Scholar, James Manyika has a DPhil, MSc with distinction and MA from Oxford in AI and robotics, mathematics, and computer science.
We are seeing more and more people entering this space to “influence” youngsters.
Sir David Cox’s path-breaking contributions in the field of statistics include the logistic regression, the proportional hazards model and the Cox process.
Spatio-temporal graphs are made of static structures and time-varying features, and such information in a graph requires a neural network that can deal with time-varying features of the graph. Neural networks which are developed to deal with time-varying features of the graph can be considered as Spatio-temporal graph neural networks. 
A recommender system, sometimes known as a recommendation engine, is a type of information filtering system that attempts to forecast a user’s “rating” or “preference” for an item. In this post, we will look at RGRecSys, a library that performs constraint evaluation of recommender systems.
Stay Connected with a larger ecosystem of data science and ML Professionals
Discover special offers, top stories, upcoming events, and more.
Stay up to date with our latest news, receive exclusive deals, and more.
© Analytics India Magazine Pvt Ltd 2022
Terms of use
Privacy Policy