Perceiver: General Perception with Iterative Attention

Jaegle, Andrew; Gimeno, Felix; Brock, Andrew; Zisserman, Andrew; Vinyals, Oriol; Carreira, João

doi:10.48550/arxiv.2103.03206

Cited by 55 publications

(79 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The literature on multimodal processing usually relies on modality-specific feature extractors (Kaiser et al, 2017;Alayrac et al, 2020). However the recently introduced Perceiver (Jaegle et al, 2021b;a) uses a shared achitecture for processing a wide range of data modalities, sharing similarities with our setup. However they are only applied to array representations of data, and it would be an interesting research direction to apply them to functa for downstream tasks.…”

Section: Related Workmentioning

confidence: 99%

From data to functa: Your data point is a function and you can treat it like one

Dupont¹,

Kim²,

Eslami³

et al. 2022

Preprint

View full text Add to dashboard Cite

It is common practice in deep learning to represent a measurement of the world on a discrete grid, e.g. a 2D grid of pixels. However, the underlying signal represented by these measurements is often continuous, e.g. the scene depicted in an image. A powerful continuous alternative is then to represent these measurements using an implicit neural representation, a neural function trained to output the appropriate measurement value for any input spatial location. In this paper, we take this idea to its next level: what would it take to perform deep learning on these functions instead, treating them as data? In this context we refer to the data as functa, and propose a framework for deep learning on functa. This view presents a number of challenges around efficient conversion from data to functa, compact representation of functa, and effectively solving downstream tasks on functa. We outline a recipe to overcome these challenges and apply it to a wide range of data modalities including images, 3D shapes, neural radiance fields (NeRF) and data on manifolds. We demonstrate that this approach has various compelling properties across data modalities, in particular on the canonical tasks of generative modeling, data imputation, novel view synthesis and classification.

show abstract

Section: Related Workmentioning

confidence: 99%

From data to functa: Your data point is a function and you can treat it like one

Dupont¹,

Kim²,

Eslami³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This means that best-practice models cannot be used in different domains without modification. Perceiver [217] is an interesting solution proposed to handle the configuration of different data shapes based on Transformers networks [218], which are sequence transduction models that rely entirely on the attention mechanism. The usage of Transformers in computer vision has shown their efficiency in classification tasks using considerably lower computation resources.…”

Section: G Modality Agnostic Learningmentioning

confidence: 99%

Edge-Native Intelligence for 6G Communications Driven by Federated Learning: A Survey of Trends and Challenges

Al-Quraan¹,

Mohjazi²,

Bariah³

et al. 2021

Preprint

View full text Add to dashboard Cite

The unprecedented surge of data volume in wireless networks empowered with artificial intelligence (AI) opens up new horizons for providing ubiquitous data-driven intelligent services. Traditional cloud-centric machine learning (ML)-based services are implemented by collecting datasets and training models centrally. However, this conventional training technique encompasses two challenges: (i) high communication and energy cost due to increased data communication, (ii) threatened data privacy by allowing untrusted parties to utilise this information. Recently, in light of these limitations, a new emerging technique, coined as federated learning (FL), arose to bring ML to the edge of wireless networks. FL can extract the benefits of data silos by training a global model in a distributed manner, orchestrated by the FL server. FL exploits both decentralised datasets and computing resources of participating clients to develop a generalised ML model without compromising data privacy. In this article, we introduce a comprehensive survey of the fundamentals and enabling technologies of FL. Moreover, an extensive study is presented detailing various applications of FL in wireless networks and highlighting their challenges and limitations. The efficacy of FL is further explored with emerging prospective beyond fifth generation (B5G) and sixth generation (6G) communication systems. The purpose of this survey is to provide an overview of the state-of-the-art of FL applications in key wireless technologies that will serve as a foundation to establish a firm understanding of the topic. Lastly, we offer a road forward for future research directions.

show abstract

“…Instance code. Inspired by the works (Jaegle et al 2021;Wang et al 2021a), which use a latent space to encode taskspecific information, we introduce instance code e, a L × D vector to VIS task, where the L is the maximum detected instance number in a frame and D is the feature dimension for each instance. Our instance code represents both the class and mask information of one instance for each slot in an order-aware fashion; thus, we can directly use slot indices to represent instance identities.…”

Section: Hybrid Representation For Video Framementioning

confidence: 99%

Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation

Li¹,

Wang²,

Li³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recently, transformer-based image segmentation methods have achieved notable success against previous solutions. While for video domains, how to effectively model temporal context with the attention of object instances across frames remains an open problem. In this paper, we propose an online video instance segmentation framework with a novel instance-aware temporal fusion method. We first leverage the representation (Wang et al. 2021a), i.e., a latent code in the global context (instance code) and CNN feature maps to represent instance-and pixel-level features. Based on this representation, we introduce a cropping-free temporal fusion approach to model the temporal consistency between video frames. Specifically, we encode global instancespecific information in the instance code and build up interframe contextual fusion with hybrid attentions between the instance codes and CNN feature maps. Inter-frame consistency between the instance codes is further enforced with order constraints. By leveraging the learned hybrid temporal consistency, we are able to directly retrieve and maintain instance identities across frames, eliminating the complicated frame-wise instance matching in prior methods. Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. Notably, our model also eclipses all offline methods when using the ResNet-50 backbone.* This work was done when Xiang Li was an intern at MSRA.

show abstract

Perceiver: General Perception with Iterative Attention

Cited by 55 publications

References 0 publications

From data to functa: Your data point is a function and you can treat it like one

From data to functa: Your data point is a function and you can treat it like one

Edge-Native Intelligence for 6G Communications Driven by Federated Learning: A Survey of Trends and Challenges

Hybrid Instance-aware Temporal Fusion for Online Video Instance Segmentation

Contact Info

Product

Resources

About