Facebook Open-Sources Computer Vision Model Multiscale Vision Transformers – InfoQ.com

Live Webinar and Q&A – What Does The JVM Garbage Collector Really Do? (Live Webinar Oct 21st, 2021) Save Your Seat
Facilitating the spread of knowledge and innovation in professional software development

In this article, we’ll explore the benefits of using blockchain for business solutions, describing the differences between public and private versions of this technology in practice. We’ll also talk about a new type of chain — a hybrid of private and public chains which takes the benefits of both to create a truly versatile platform with no compromises.
In this episode, Thomas Betts speaks with Tammy Bryant Butow, principal SRE at Gremlin, about training new site reliability engineers. The discussion covers a formal SRE Apprenticeship program Butow led at DropBox, and gets into ideas about the best way to teach people new technical skills.
In this second edition of the Modern Data Engineering eMag, we’ll explore the ways in which data engineering has changed in the last few years. Data engineering has now become key to the success of products and companies. And new requirements breed new solutions.  
The adoption of Agile methods has been steadily growing in medical product companies over the past ten years. Practices vary from cloud-based continuous flow for data-intensive services to sprint-based for physical devices with embedded software. The question is no longer whether, but how Agile can work in medical product development – for our mix of technical, market, and regulatory constraints.
Carissa Blossom walks through the monitoring service that Uber developed to identify issues in production at the individual city level all across the globe.
Learn how to apply Microservices and DevSecOps to improve application security & deployment speed. Virtual Event on Oct 19th, 9AM EDT/ 3PM CEST
Turn advice from 64+ world-class professionals into immediate action items. Attend online on Nov 1-12.
InfoQ Homepage News Facebook Open-Sources Computer Vision Model Multiscale Vision Transformers
Sep 21, 2021 3 min read
Anthony Alford
Facebook AI Research (FAIR) recently open-sourced Multiscale Vision Transformers (MViT), a deep-learning model for computer vision based on the Transformer architecture. MViT contains several internal resolution-reduction stages and outperforms other Transformer vision models while requiring less compute power, achieving new state-of-the-art accuracy on several benchmarks.
The FAIR team described the model and several experiments in a blog post. MViT modifies the standard Transformer attention scheme, incorporating a pooling mechanism that reduces the visual resolution, while increasing the feature representation, or channel, dimension. In contrast to other computer vision (CV) models based on the Transformer, MViT does not require pre-training and contains fewer parameters, thus requiring less compute power at inference time. In a series of experiments, FAIR showed that MViT outperforms previous work on common video-understanding datasets, including Kinetics, Atomic Visual Actions (AVA), Charades, and Something-Something. According to the researchers,
Though much more work is needed, the advances enabled by MViT could significantly improve detailed human action understanding, which is a crucial component in real-world AI applications such as robotics and autonomous vehicles. In addition, innovations in video recognition architectures are an essential component of robust, safe, and human-centric AI.
Most deep-learning CV models are based on the Convolutional Neural Network (CNN) architecture. Inspired by the structure of the animal visual cortex, a CNN contains several hidden layers that reduce the spatial dimension of image input while increasing channel dimension; the output of each layer is called a feature map. Video-processing models are often based on CNNs that are extended in the time-dimension, including multiple image frames as input. With the recent success of the Transformer architecture in natural language processing (NLP) tasks, many researchers have explored the application of Transformers to vision problems, such as Google's Vision Transformer (ViT). Unlike CNNs, however, these Transformer-based architectures do not change the resolution of their internal feature maps, which results in models with very large numbers of parameters, requiring extensive pre-training on large datasets.
The key idea of MViT is combining the attention mechanism of Transformers with the multi-scale feature maps of CNN-based models. MViT accomplishes this by introducing a scale stage after a sequence of Transformer attention blocks. The scale stage reduces the spatial dimension of its input 4x, by applying a pooling operation before applying attention, a combined operation termed Multi-head Pooling Attention (MHPA). The output of the MHPA layer is then up-sampled by a multilayer perceptron (MLP) layer to double the channel dimension. The combination of these two operations "roughly preserves the computational complexity across stages."
The research team trained MViT models of various sizes and evaluated their performance on benchmarks compared to a baseline "off-the-shelf" ViT model. The small MViT model outperformed the baseline by 7.5 percentage points on the Kinetics-400 dataset while using 5.5x less FLOPs. On the Kinetics-600 dataset, the large MViT model set a new state-of-the-art accuracy of 83.4% with 8.4x fewer parameters and consuming 56.0x fewer FLOPs than the baseline. The team also investigated transfer learning, by pre-training their models on the Kinetics datasets then evaluating on AVA, Charades, and Something-Something. In all these scenarios, MViT outperformed previous models. Finally, the team also showed that MViT could perform as an image-recognition system, simply by using a single input frame. Again, MViT outperformed other Transformer models, while using fewer parameters and FLOPs.
In a Twitter discussion about the MViT paper, AI researcher Łukasz Borchmann mentioned his similar model, the Pyramidion, published last year:
In the Pyramidion, a trainable pooling was applied between layers, introducing the bottleneck gradually along the encoding process…As in MViT, it leads to better results & complexity.
The MViT code and pre-trained models are available as part of FAIR's PySlowFast video-understanding codebase.

Redis Enterprise is an in-memory database platform built by the people who develop open-source Redis. Get Started.
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
You need to Register an InfoQ account or or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.
Focus on the topics that matter in software development right now.
Deep-dive with 64+ world-class software leaders. Discover how they are applying emerging trends. Learn their use cases and best practices.
Stay ahead of the adoption curve and shape your roadmap with QCon Plus online software development conference.
InfoQ.com and all content copyright © 2006-2021 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Privacy Notice, Terms And Conditions, Cookie Policy