2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01842
|View full text |Cite
|
Sign up to set email alerts
|

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Abstract: We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
127
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 160 publications
(127 citation statements)
references
References 198 publications
(326 reference statements)
0
127
0
Order By: Relevance
“…Although R3M is proposed as a general feature representation, we also compare against using Euclidean distance in the representation space for defining dense rewards. We used the ResNet-50 model checkpoint from [19], trained on the much larger Ego4D [37] (3,500 hours) rather than SSv2 (200 hours). As shown in Figure 6, HOLD-C outperforms R3M in both RLV tasks despite having been trained on less data and requiring no language descriptions.…”
Section: Baseline Comparisonsmentioning
confidence: 99%
“…Although R3M is proposed as a general feature representation, we also compare against using Euclidean distance in the representation space for defining dense rewards. We used the ResNet-50 model checkpoint from [19], trained on the much larger Ego4D [37] (3,500 hours) rather than SSv2 (200 hours). As shown in Figure 6, HOLD-C outperforms R3M in both RLV tasks despite having been trained on less data and requiring no language descriptions.…”
Section: Baseline Comparisonsmentioning
confidence: 99%
“…ment from a point of view that is the most similar to the one of human beings. In the FPV domain, understanding the interactions between a camera wearer and the surrounding objects is a fundamental problem (Bertasius et al, 2017a(Bertasius et al, , 2017bCao et al, 2020;Cai et al, 2016;Damen et al, 2018;Damen et al, 2016;Grauman 2022;Liu et al, 2020;Ragusa et al, 2020;Wang et al, 2020). To model such interactions, the continuous knowledge of where an object of interest is located inside the video frame is advantageous.…”
Section: Introductionmentioning
confidence: 99%
“…For example, visual trackers have been exploited in solutions to comprehend social interactions through faces (Aghaei et al, 2016a(Aghaei et al, , 2016bGrauman et al, 2022), to improve the performance of hand detection for rehabilitation purposes (Visee et al, 2020), to capture hand movements for action recognition , and to forecast human-object interactions through the analysis of hand trajectories (Liu et al, 2020). Such applications have been made possible trough the development of customized tracking approaches to track specific target categories like people (Alletto et al, 2015;Nigam & Rameshan, 2017), people faces (Aghaei et al, 2016a;Grauman et al, 2022), or hands Han et al, 2020;Liu et al, 2020;Mueller et al, 2017;Sun et al, 2010;Visee et al, 2020) from a first person perspective.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Ego4D [1] NLQ task aims to localize a temporal moment in a long first-person video according to a natural language (NL) question. It resembles the video temporal grounding task, since both share the same problem definition.…”
Section: Introductionmentioning
confidence: 99%