2021
DOI: 10.48550/arxiv.2107.14795
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Abstract: The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce outputs of arbitrary size and semantics. Perceiver IO s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
93
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 54 publications
(97 citation statements)
references
References 63 publications
(93 reference statements)
0
93
0
Order By: Relevance
“…Note that the only parameters introduced into the model from this fusion method are the linear projections of bottleneck tokens from one view to the next, and the bottleneck tokens themselves which are learned from random initialization. We also note that "bottleneck" tokens have also been used by [31,49].…”
Section: Cross-view Attention (Cva)mentioning
confidence: 99%
“…Note that the only parameters introduced into the model from this fusion method are the linear projections of bottleneck tokens from one view to the next, and the bottleneck tokens themselves which are learned from random initialization. We also note that "bottleneck" tokens have also been used by [31,49].…”
Section: Cross-view Attention (Cva)mentioning
confidence: 99%
“…It was noted a decade ago that big data along with simple architectures are "unreasonably effective" [13] at solving many perception problems, and subsequent progress has only reinforced this [50]. Computer vision has moved from architectures like ConvNets, which are highly general image processors [26], to methods that are based on Transformers [54] such as ViT [9] and Perceivers [20,21], where the underlying Transformer can be equally effective across multiple domains like sound and language. Unifying architectures is useful because architectural improvements can be propagated across tasks and domains trivially.…”
Section: Related Workmentioning
confidence: 99%
“…1), like a loadable software solution instead of a more rigid hardware solution. As the general model we employ the recently published Perceiver IO [20] and as domain we focus on multiview geometry and 3D reconstruction, an area of computer vision where architectural specialization is particularly exuberant [17,19,27,33,41,62,65].…”
Section: Introductionmentioning
confidence: 99%
“…In this case random modality dropout could be used during training as e.g. done in AVSlowfast [41] or Perceiver [21].…”
Section: Multi-modal Fusion Transformermentioning
confidence: 99%
“…Although modality agnostic transformers such as Perceive-rIO [21] have been proposed, they have been constructed with a different goal of learning a latent space that can cover multiple tasks in different domains. Compared to our work, the latent space in such cases mainly serves the purpose of compressing multiple inputs and tasks in one model.…”
Section: Introductionmentioning
confidence: 99%