Ego4D: Around the World in 3,000 Hours of Egocentric Video

Grauman, Kristen; Westbury, Andrew; Byrne, Eugene H.; Chavis, Zachary Q.; Furnari, Antonino; Girdhar, Rohit; Hamburger, Jackson; Jiang, Hao; Liu, Miao; Liu, Xingyu; Martin, M. De La Puente; Nagarajan, Tushar; Radosavovic, Ilija; Ramakrishnan, S.; Ryan, Fiona; Sharma, Jayant; Wray, Michael; Xu, Mantao; Xu, Eric Zhongcong; Zhao, Chen; Bansal, Siddhant; Batra, Dhruv; Cartillier, Vincent; Crane, Sean; Do, Tien Van; Morrie, Doulaty,; Erapalli, Akshay; Feichtenhofer, Christoph; Fragomeni, Adriano; Fu, Qichen; Gebreselasie, Abrham; González, Cristina; Hillis, J P; Huang, Xuhua; Huang, Yifei; Jia, Wenqi; Khoo, Weslie; Kolar, Jachym; Kottur, Satwik; Kumar, Anurag; Landini, Federico; Li, Chao; Li, Yanghao; Li, Zhenqiang; Mangalam, Karttikeya; Modhugu, Raghava; Munro, Jonathan; Murrell, Tullie; Nishiyasu, Takumi; Price, Will; Puentes, Paola Ruiz; Ramazanova, Merey; Sarı, Leda; Somasundaram, Kiran; Southerland, Audrey; Sugano, Yusuke; Tao, Ruijie; Vo, Minh Thanh; Wang, Yuchen; Wu, Xindi; Yagi, Takuma; Zhao, Ziwei; Zhu, Yunyi; Arbeláez, Pablo; Crandall, David J.; Damen, Dima; Farinella, Giovanni Maria; Fuegen, Christian; Ghanem, Bernard; Ithapu, Vamsi Krishna; Jawahar, C. V.; Joo, Hanbyul; Kitani, Kris M.; Li, Haizhou; Newcombe, Richard; Oliva, Aude; Park, Hyun Soo; Rehg, James M.; Sato, Yoichi; Shi, Jianbo; Shou, Mike Zheng; Torralba, Antonio; Torresani, Lorenzo; Yan, Mingfei; Malik, Jitendra

doi:10.1109/cvpr52688.2022.01842

Cited by 160 publications

(127 citation statements)

References 198 publications

(326 reference statements)

Supporting

Mentioning

127

Contrasting

Order By: Relevance

“…Although R3M is proposed as a general feature representation, we also compare against using Euclidean distance in the representation space for defining dense rewards. We used the ResNet-50 model checkpoint from [19], trained on the much larger Ego4D [37] (3,500 hours) rather than SSv2 (200 hours). As shown in Figure 6, HOLD-C outperforms R3M in both RLV tasks despite having been trained on less data and requiring no language descriptions.…”

Section: Baseline Comparisonsmentioning

confidence: 99%

Learning Reward Functions for Robotic Manipulation by Observing Humans

Alakuijala¹,

Dulac-Arnold²,

Mairal³

et al. 2022

Preprint

View full text Add to dashboard Cite

Observing a human demonstrator manipulate objects provides a rich, scalable and inexpensive source of data for learning robotic policies. However, transferring skills from human videos to a robotic manipulator poses several challenges, not least a difference in action and observation spaces. In this work, we use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. Thanks to the diversity of this training data, the learned reward function sufficiently generalizes to image observations from a previously unseen robot embodiment and environment to provide a meaningful prior for directed exploration in reinforcement learning. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective. By conditioning the function on a goal image, we are able to reuse one model across a variety of tasks. Unlike prior work on leveraging human videos to teach robots, our method, Human Offline Learned Distances (HOLD) requires neither a priori data from the robot environment, nor a set of taskspecific human demonstrations, nor a predefined notion of correspondence across morphologies, yet it is able to accelerate training of several manipulation tasks on a simulated robot arm compared to using only a sparse reward obtained from task completion.

show abstract

Section: Baseline Comparisonsmentioning

confidence: 99%

Learning Reward Functions for Robotic Manipulation by Observing Humans

Alakuijala¹,

Dulac-Arnold²,

Mairal³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…ment from a point of view that is the most similar to the one of human beings. In the FPV domain, understanding the interactions between a camera wearer and the surrounding objects is a fundamental problem (Bertasius et al, 2017a(Bertasius et al, , 2017bCao et al, 2020;Cai et al, 2016;Damen et al, 2018;Damen et al, 2016;Grauman 2022;Liu et al, 2020;Ragusa et al, 2020;Wang et al, 2020). To model such interactions, the continuous knowledge of where an object of interest is located inside the video frame is advantageous.…”

Section: Introductionmentioning

confidence: 99%

“…For example, visual trackers have been exploited in solutions to comprehend social interactions through faces (Aghaei et al, 2016a(Aghaei et al, , 2016bGrauman et al, 2022), to improve the performance of hand detection for rehabilitation purposes (Visee et al, 2020), to capture hand movements for action recognition , and to forecast human-object interactions through the analysis of hand trajectories (Liu et al, 2020). Such applications have been made possible trough the development of customized tracking approaches to track specific target categories like people (Alletto et al, 2015;Nigam & Rameshan, 2017), people faces (Aghaei et al, 2016a;Grauman et al, 2022), or hands Han et al, 2020;Liu et al, 2020;Mueller et al, 2017;Sun et al, 2010;Visee et al, 2020) from a first person perspective.…”

Section: Introductionmentioning

confidence: 99%

“…While the use cases of object tracking in egocentric vision are manifold and the benefit of tracking generic objects is clear as previously discussed, it is evident that visual object tracking is still not a dominant technology in FPV. Only very recent FPV pipelines are starting to employ generic object trackers (Grauman et al, 2022;Rai et al, 2021), but a solution specifically designed to track generic objects in first person videos is still missing. We think this lack of interest towards visual object tracking in FPV is mainly due to the limited amount of knowledge present in the literature about the capabilities of current visual object trackers in FPV videos.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Visual Object Tracking in First Person Vision

et al. 2022

View full text Add to dashboard Cite

The understanding of human-object interactions is fundamental in First Person Vision (FPV). Visual tracking algorithms which follow the objects manipulated by the camera wearer can provide useful information to effectively model such interactions. In the last years, the computer vision community has significantly improved the performance of tracking algorithms for a large variety of target objects and scenarios. Despite a few previous attempts to exploit trackers in the FPV domain, a methodical analysis of the performance of state-of-the-art trackers is still missing. This research gap raises the question of whether current solutions can be used “off-the-shelf” or more domain-specific investigations should be carried out. This paper aims to provide answers to such questions. We present the first systematic investigation of single object tracking in FPV. Our study extensively analyses the performance of 42 algorithms including generic object trackers and baseline FPV-specific trackers. The analysis is carried out by focusing on different aspects of the FPV setting, introducing new performance measures, and in relation to FPV-specific tasks. The study is made possible through the introduction of TREK-150, a novel benchmark dataset composed of 150 densely annotated video sequences. Our results show that object tracking in FPV poses new challenges to current visual trackers. We highlight the factors causing such behavior and point out possible research directions. Despite their difficulties, we prove that trackers bring benefits to FPV downstream tasks requiring short-term object tracking. We expect that generic object tracking will gain popularity in FPV as new and FPV-specific methodologies are investigated.

show abstract

“…Ego4D [1] NLQ task aims to localize a temporal moment in a long first-person video according to a natural language (NL) question. It resembles the video temporal grounding task, since both share the same problem definition.…”

Section: Introductionmentioning

confidence: 99%

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

Hou¹,

Zhong²,

Ji³

et al. 2022

Preprint

View full text Add to dashboard Cite

This technical report describes the CONE [2] approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022. We leverage our model CONE, an efficient window-centric COarse-to-fiNE alignment framework. Specifically, CONE dynamically slices the long video into candidate windows via a sliding window approach. Centering at windows, CONE (1) learns the interwindow (coarse-grained) semantic variance through contrastive learning and speeds up inference by pre-filtering the candidate windows relevant to the NL query, and (2) conducts intra-window (fine-grained) candidate moments ranking utilizing the powerful multi-modal alignment ability of the contrastive vision-text pre-trained model EgoVLP. On the blind test set, CONE achieves 15.26 and 9.24 for R1@IoU=0.3 and R1@IoU=0.5, respectively.

show abstract

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Cited by 160 publications

References 198 publications

Learning Reward Functions for Robotic Manipulation by Observing Humans

Learning Reward Functions for Robotic Manipulation by Observing Humans

Visual Object Tracking in First Person Vision

An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

Contact Info

Product

Resources

About