2021
DOI: 10.48550/arxiv.2103.03206
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Perceiver: General Perception with Iterative Attention

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
73
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 55 publications
(79 citation statements)
references
References 0 publications
1
73
0
Order By: Relevance
“…The literature on multimodal processing usually relies on modality-specific feature extractors (Kaiser et al, 2017;Alayrac et al, 2020). However the recently introduced Perceiver (Jaegle et al, 2021b;a) uses a shared achitecture for processing a wide range of data modalities, sharing similarities with our setup. However they are only applied to array representations of data, and it would be an interesting research direction to apply them to functa for downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…The literature on multimodal processing usually relies on modality-specific feature extractors (Kaiser et al, 2017;Alayrac et al, 2020). However the recently introduced Perceiver (Jaegle et al, 2021b;a) uses a shared achitecture for processing a wide range of data modalities, sharing similarities with our setup. However they are only applied to array representations of data, and it would be an interesting research direction to apply them to functa for downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…This means that best-practice models cannot be used in different domains without modification. Perceiver [217] is an interesting solution proposed to handle the configuration of different data shapes based on Transformers networks [218], which are sequence transduction models that rely entirely on the attention mechanism. The usage of Transformers in computer vision has shown their efficiency in classification tasks using considerably lower computation resources.…”
Section: G Modality Agnostic Learningmentioning
confidence: 99%
“…Instance code. Inspired by the works (Jaegle et al 2021;Wang et al 2021a), which use a latent space to encode taskspecific information, we introduce instance code e, a L × D vector to VIS task, where the L is the maximum detected instance number in a frame and D is the feature dimension for each instance. Our instance code represents both the class and mask information of one instance for each slot in an order-aware fashion; thus, we can directly use slot indices to represent instance identities.…”
Section: Hybrid Representation For Video Framementioning
confidence: 99%