2022
DOI: 10.48550/arxiv.2202.11423
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Abstract: Automatically understanding human behaviour allows household robots to identify the most critical needs and plan how to assist the human according to the current situation. However, the majority of such methods are developed under the assumption that a large amount of labelled training examples is available for all concepts-of-interest. Robots, on the other hand, operate in constantly changing unstructured environments, and need to adapt to novel action categories from very few samples. Methods for data-effici… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 47 publications
(106 reference statements)
0
1
0
Order By: Relevance
“…In modern times, transformer backbones have shown strong capacity in establishing longrange dependency information in image or video data [16], which prove beneficial for many downstream tasks. Based on the pioneering work of Vision Transformer (ViT) [36] for image recognition, architectures of dense prediction transformers [37], [38] and video classification transformers [5], [11], [13], [39], [40] are created. In the activity recognition area, Trear [41] proposes a transformer-based RGB-D egocentric activity recognition framework by adapting self-attention to model temporal structure from different modalities.…”
Section: Related Workmentioning
confidence: 99%
“…In modern times, transformer backbones have shown strong capacity in establishing longrange dependency information in image or video data [16], which prove beneficial for many downstream tasks. Based on the pioneering work of Vision Transformer (ViT) [36] for image recognition, architectures of dense prediction transformers [37], [38] and video classification transformers [5], [11], [13], [39], [40] are created. In the activity recognition area, Trear [41] proposes a transformer-based RGB-D egocentric activity recognition framework by adapting self-attention to model temporal structure from different modalities.…”
Section: Related Workmentioning
confidence: 99%