Perceiver IO: A General Architecture for Structured Inputs &amp; Outputs

Jaegle, Andrew; Borgeaud, Sebastian; Alayrac, Jean-Baptiste; Doersch, Carl; Ionescu, Catalin; Ding, David K.; Koppula, Skanda; Zoran, Daniel; Brock, Andrew; Shelhamer, Evan; Hénaff, Olivier J.; Botvinick, Matthew; Zisserman, Andrew; Vinyals, Oriol; Carreira, Joāo

doi:10.48550/arxiv.2107.14795

Cited by 54 publications

(97 citation statements)

References 63 publications

(93 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that the only parameters introduced into the model from this fusion method are the linear projections of bottleneck tokens from one view to the next, and the bottleneck tokens themselves which are learned from random initialization. We also note that "bottleneck" tokens have also been used by [31,49].…”

Section: Cross-view Attention (Cva)mentioning

confidence: 99%

Multiview Transformers for Video Recognition

Shen¹,

Xiong²,

Arnab³

et al. 2022

Preprint

View full text Add to dashboard Cite

Video understanding requires reasoning at multiple spatiotemporal resolutions -from short fine-grained motions to events taking place over longer durations. Although transformer architectures have recently advanced the stateof-the-art, they have not explicitly modelled different spatiotemporal resolutions. To this end, we present Multiview Transformers for Video Recognition (MTV). Our model consists of separate encoders to represent different views of the input video with lateral connections to fuse information across views. We present thorough ablation studies of our model and show that MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost across a range of model sizes. Furthermore, we achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining. We will release code and pretrained checkpoints.

show abstract

Section: Cross-view Attention (Cva)mentioning

confidence: 99%

Multiview Transformers for Video Recognition

Shen¹,

Xiong²,

Arnab³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…It was noted a decade ago that big data along with simple architectures are "unreasonably effective" [13] at solving many perception problems, and subsequent progress has only reinforced this [50]. Computer vision has moved from architectures like ConvNets, which are highly general image processors [26], to methods that are based on Transformers [54] such as ViT [9] and Perceivers [20,21], where the underlying Transformer can be equally effective across multiple domains like sound and language. Unifying architectures is useful because architectural improvements can be propagated across tasks and domains trivially.…”

Section: Related Workmentioning

confidence: 99%

“…1), like a loadable software solution instead of a more rigid hardware solution. As the general model we employ the recently published Perceiver IO [20] and as domain we focus on multiview geometry and 3D reconstruction, an area of computer vision where architectural specialization is particularly exuberant [17,19,27,33,41,62,65].…”

Section: Introductionmentioning

confidence: 99%

Input-level Inductive Biases for 3D Reconstruction

Wang¹,

Doersch²,

Arandjelović³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Queries for image 1 Input image pair Multiple view geometry inductive biases flatten() Output depth for image 1 Input matrix Generalist Perception ModelFigure 1. Input-level inductive biases. We explore 3D reconstruction using a generalist perception model, the recent Perceiver IO [20] which ingests a matrix of unordered and flattened inputs (e.g. pixels). The model is interrogated using a query matrix and generates an output for every query -in this paper the outputs are depth values for all pixels of the input image pair. We incorporate inductive biases useful for multiple view geometry into this generalist model without having to touch its architecture, by instead encoding them directly as additional inputs.

show abstract

“…In this case random modality dropout could be used during training as e.g. done in AVSlowfast [41] or Perceiver [21].…”

Section: Multi-modal Fusion Transformermentioning

confidence: 99%

“…Although modality agnostic transformers such as Perceive-rIO [21] have been proposed, they have been constructed with a different goal of learning a latent space that can cover multiple tasks in different domains. Compared to our work, the latent space in such cases mainly serves the purpose of compressing multiple inputs and tasks in one model.…”

Section: Introductionmentioning

confidence: 99%

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Shvetsova¹,

Chen²,

Rouditchenko³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization.

show abstract

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Cited by 54 publications

References 63 publications

Multiview Transformers for Video Recognition

Multiview Transformers for Video Recognition

Input-level Inductive Biases for 3D Reconstruction

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Contact Info

Product

Resources

About