Ego4D: Around the World in 3,000 Hours of Egocentric Video

Grauman, Kristen; Westbury, Andrew; Byrne, Eugene H.; Chavis, Zachary Q.; Furnari, Antonino; Girdhar, Rohit; Hamburger, Jackson; Jiang, Hao; Liu, Miao; Liu, Xingyu; Martín, Miguel; Nagarajan, Tushar; Radosavovic, Ilija; Ramakrishnan, S.; Ryan, Fiona; Sharma, Jayant; Wray, Michael; Xu, Ming; Xu, Eric Zhongcong; Chen, Zhao; Bansal, Siddhant; Batra, Dhruv; Cartillier, Vincent; Crane, Sean; Do, Tien Van; Morrie, Doulaty,; Erapalli, Akshay; Feichtenhofer, Christoph; Fragomeni, Adriano; Fu, Qichen; Gebreselasie, Abrham; González, Cristina; Hillis, James; Huang, Xin; Huang, Y.; Wang, Jia; Khoo, Weslie; Kolář, Jáchym; Kottur, Satwik; Kumar, Anurag; Landini, Federico; Li, Chao; Li, Yanghao; Li, Zhenqiang; Mangalam, Karttikeya; Modhugu, Raghava; Munro, Jonathan; Murrell, Tullie; Nishiyasu, Takumi; Price, W. G.; Puentes, Paola Ruiz; Ramazanova, Merey; Sarı, Leda; Somasundaram, Kiran; Southerland, Audrey; Sugano, Yusuke; Tao, Ruijie; Vo, Minh; Wang, Yuchen; Wu, X. D.; Yagi, T.; Zhao, Ziwei; Zhu, Yunyi; Arbeláez, Pablo; Crandall, David J.; Damen, Dima; Farinella, Giovanni Maria; Fuegen, Christian; Ghanem, Bernard; Ithapu, Vamsi Krishna; Jawahar, C. V.; Joo, Hanbyul; Li, Haizhou; Newcombe, Richard; Oliva, Aude; Park, Hyun Soo; Rehg, James M.; Satô, Yukio; Shi, Jianbo; Shou, Mike Zheng; Torralba, Antonio; Torresani, Lorenzo; Yan, Mingfei; Malik, Jitendra

doi:10.48550/arxiv.2110.07058

Cited by 15 publications

(19 citation statements)

References 163 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The increasing accuracy of monocular and multi-view automated methods for face, pose, and hands estimation has contributed in reducing the annotation effort. Still, the largest available datasets that provide thousands of hours of audiovisual material and feature the widest spectrum of behaviors do not provide such annotations (Carreira et al, 2019;Zhao et al, 2019;Monfort et al, 2020;Grauman et al, 2021). In contrast, the automated methods for high-level representations recognition such as feedback responses or atomic action labels are not accurate enough to significantly help in their annotation procedures.…”

Section: Discussionmentioning

confidence: 99%

“…Thanks to the camera portability during the collection, egocentric datasets can record social behavior in less constrained environments. Very recently, Grauman et al (2021) released more than 3000 hours of in-the-wild egocentric recordings of human actions, which also include social interactions. Finally, the computer-mediated recording setup elicits a very particular behavior due to the idiosyncrasies of the communication channel (McKeown et al, 2010;Ringeval et al, 2013;Cafaro et al, 2017;Feng et al, 2017;Kossaifi et al, 2019).…”

Section: Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Didn't see that coming: a survey on non-verbal social human behavior forecasting

Barquero¹,

Núñez²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Non-verbal social human behavior forecasting has increasingly attracted the interest of the research community in recent years. Its direct applications to human-robot interaction and socially-aware human motion generation make it a very attractive field. In this survey, we define the behavior forecasting problem for multiple interactive agents in a generic way that aims at unifying the fields of social signals prediction and human motion forecasting, traditionally separated. We hold that both problem formulations refer to the same conceptual problem, and identify many shared fundamental challenges: future stochasticity, context awareness, history exploitation, etc. We also propose a taxonomy that comprises methods published in the last 5 years in a very informative way and describes the current main concerns of the community with regard to this problem. In order to promote further research on this field, we also provide a summarized and friendly overview of audiovisual datasets featuring non-acted social interactions. Finally, we describe the most common metrics used in this task and their particular issues.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Didn't see that coming: a survey on non-verbal social human behavior forecasting

Barquero¹,

Núñez²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…HPS [22] reconstructs the body pose and shape of a subject wearing a head-mounted camera moving in large 3D scene, but with few social interactions. Recently, Ego4D [19] collects a massive amount of egocentric videos for various tasks including action and social behavior understanding, making significant advances in stimulating future research in the egocentric domain. Our dataset is complementary to Ego4D in that we provide 3D human pose and shape ground-truth for the camera wearer and their interaction partner.…”

Section: Related Workmentioning

confidence: 99%

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

Zhang¹,

Qian²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

“…The community's interest has quickly grown [16,17,19,83] in recent years, thanks to the possibilities that these data open for the evaluation and understanding of human behavior, leading to the design of novel architectures [30,51,52,91,104]. While the use of optical flow has been the de-facto procedure [14][15][16][17]19,41] in FPAR, the interest has recently shifted towards more lightweight alternatives, such as gaze [27,59,71], audio [9,52,77], depth [32], skeleton [32], and inertial measurements [41], to enable motion modeling in online settings. These, when combined with traditional modalities, produce encouraging results, but not enough to make them viable alternatives.…”

Section: Related Workmentioning

confidence: 99%

“…With the advent of novel large-scale datasets [14,15], new tasks are being proposed, such as wearer's pose estimation [105] and egocentric videos anonymization [95]. This trend will grow in the next years thanks to the very recent release of Ego4D [41], a massive-scale egocentric Figure 1. N-EPIC-Kitchens: the first event-based dataset for egocentric action recognition.…”

Section: Introductionmentioning

confidence: 99%

E$^2$(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition

Plizzari¹,

Goletto²,

Cannici³

et al. 2021

Preprint

View full text Add to dashboard Cite

Event cameras are novel bio-inspired sensors, which asynchronously capture pixel-level intensity changes in the form of "events". Due to their sensing mechanism, event cameras have little to no motion blur, a very high temporal resolution and require significantly less power and memory than traditional frame-based cameras. These characteristics make them a perfect fit to several real-world applications such as egocentric action recognition on wearable devices, where fast camera motion and limited power challenge traditional vision sensors. However, the ever-growing field of event-based vision has, to date, overlooked the potential of event cameras in such applications. In this paper, we show that event data is a very valuable modality for egocentric action recognition. To do so, we introduce N-EPIC-Kitchens, the first event-based camera extension of the large-scale EPIC-Kitchens dataset. In this context, we propose two strategies: (i) directly processing eventcamera data with traditional video-processing architectures (E 2 (GO)) and (ii) using event-data to distill optical flow information (E 2 (GO)MO). On our proposed benchmark, we show that event data provides a comparable performance to RGB and optical flow, yet without any additional flow computation at deploy time, and an improved performance of up to 4% with respect to RGB only information.

show abstract

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Cited by 15 publications

References 163 publications

Didn't see that coming: a survey on non-verbal social human behavior forecasting

Didn't see that coming: a survey on non-verbal social human behavior forecasting

EgoBody: Human Body Shape and Motion of Interacting People from Head-Mounted Devices

E$^2$(GO)MOTION: Motion Augmented Event Stream for Egocentric Action Recognition

Contact Info

Product

Resources

About