Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

Sener, Fadime; Chatterjee, Dibyadip; Shelepov, Daniel; He, Kai; Singhania, Dipika; Wang, Robert; Yao, Angela

doi:10.1109/cvpr52688.2022.02042

Cited by 39 publications

(33 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While some of these tasks have been studied in previous works, none of them has been studied in industrial scenarios from the egocentric perspective also considering multimodal observations. Moreover, there are only few datasets publicy available [15,41,81] which can be used to study different tasks simultaneously and to develop a complete system for human behavior understanding taking into account different aspects (e.g., actions, interactions, objects, future intentions).…”

Section: Benchmarks and Baseline Resultsmentioning

confidence: 99%

“…Inspired by the first version of the MECCANO dataset [74], [81] proposed Assembly101 which is a procedural activity dataset comprising multi-view videos in which subjects assembly and disassembly toys. Contextually, they benchmarked three action understanding tasks (i.e., action recognition, action anticipation and temporal segmentation) and proposed a new task which is related to mistakes detection.…”

Section: Datasets For Human Behavior Understandingmentioning

confidence: 99%

See 1 more Smart Citation

MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

Ragusa¹,

Furnari²,

Farinella³

2022

Preprint

View full text Add to dashboard Cite

Wearable cameras allow to acquire images and videos from the user's perspective. These data can be processed to understand humans behavior. Despite human behavior analysis has been thoroughly investigated in third person vision, it is still understudied in egocentric settings and in particular in industrial scenarios. To encourage research in this field, we present MECCANO, a multimodal dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view, such as recognizing and anticipating human-object interactions. With the MECCANO dataset, we explored five different tasks including 1) Action Recognition, 2) Active Objects Detection and Recognition, 3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and 5) Next-Active Objects Detection. We propose a benchmark aimed to study human behavior in the considered industrial-like scenario which demonstrates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms. To support research in this field, we publicy release the dataset at https://iplab.dmi.unict.it/MECCANO/.

show abstract

Section: Benchmarks and Baseline Resultsmentioning

confidence: 99%

Section: Datasets For Human Behavior Understandingmentioning

confidence: 99%

MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

Ragusa¹,

Furnari²,

Farinella³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Large-scale narrated instructional video datasets [6,17,25,30,31] have paved the way for learning joint video-language representations and task structure from videos. More recently, datasets such as Assembly-101 dataset [21] and Ikea ASM [3] provide videos of people assembling and disassembling toys and furniture. Assembly-101 also contains annotations for detecting mistakes in the video.…”

Section: Related Workmentioning

confidence: 99%

Learning and Verification of Task Structure in Instructional Videos

Narasimhan¹,

Yu²,

Bell³

et al. 2023

Preprint

View full text Add to dashboard Cite

Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate Video-Taskformer on 3 existing benchmarks-procedural activity recognition, step classification, and step forecasting-and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.

show abstract

“…This field has important applications in egocentric robotics vision [27] and virtual reality [17]. Unfortunately, despite the availability of several related benchmarks [16,29,40,49], current Ego-HOI works often require bulky laboratory equipment like headset cameras for data collection.…”

Section: Introductionmentioning

confidence: 99%

“…To validate our approach, we further define new benchmark settings called Ego-HOI-XView, which utilizes third-person videos during pre-training to help learn HOI knowledge for cross-view fine-tuning and inference in egocentric videos. The benchmarks are based on two multi-view datasets, Assembly101 [49] and H2O [29], and are designed to evaluate cross-view egocentric human-object interaction recognition. We conduct extensive experiments and analyses on these benchmarks to verify the transferable ability of our model across different views.…”

Section: Introductionmentioning

confidence: 99%

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World

Xu,

Zheng,

Jin

2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture finegrained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple thirdperson views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at https://github.com/xuboshen/pov_acmmm2023. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding; Scene understanding; Transfer learning.

show abstract

Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities

Cited by 39 publications

References 32 publications

MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

Learning and Verification of Task Structure in Instructional Videos

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-view World

Contact Info

Product

Resources

About