We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.
We present a new computational model for gaze prediction in egocentric videos by exploring patterns in temporal shift of gaze fixations (attention transition) that are dependent on egocentric manipulation tasks. Our assumption is that the high-level context of how a task is completed in a certain way has a strong influence on attention transition and should be modeled for gaze prediction in natural dynamic scenes. Specifically, we propose a hybrid model based on deep neural networks which integrates task-dependent attention transition with bottomup saliency prediction. In particular, the task-dependent attention transition is learned with a recurrent neural network to exploit the temporal context of gaze fixations, e.g. looking at a cup after moving gaze away from a grasped bottle. Experiments on public egocentric activity datasets show that our model significantly outperforms state-of-the-art gaze prediction methods and is able to learn meaningful transition of human attention.
Object co-segmentation is the task of segmenting the same objects from multiple images. In this paper, we propose the Attention Based Object Co-Segmentation for object co-segmentation that utilize a novel attention mechanism in the bottleneck layer of deep neural network for the selection of semantically related features. Furthermore, we take the benefit of attention learner and propose an algorithm to segment multi-input images in linear time complexity. Experiment results demonstrate that our model achieves state of the art performance on multiple datasets, with a significant reduction of computational time.
Scope of Reproducibility -The following work is a reproducibility report for CLRNet: Cross Layer Refinement Network for Lane Detection [1]. The basic code was made available by the author at this https url. The paper proposes a novel Cross Layer Refinement Network to utilize both high and low level features for lane detection. The authors assert that the proposed technique sets the new state-of-the-art on three lane-detection benchmarks.Methodology -The proposed model employs a two-stage approach to lane detection. Initially, coarse lane detection is achieved through the extraction of high-level semantic features. This is followed by refinement of the output based on low-level features, aimed at enhancing the localization accuracy of the model. The authors' code was used to benchmark the claims. Some further experiments were investigated thereafter. Kaggle, a free-to-use platform for deep learning experiments, was used to train these models. We have reproduced the code base in Pytorch Lightning and found consistent results across the board. Results -The central claims presented by the authors were subject to reproduction and verification. The validity of the claims was evaluated using two out of the three datasets referenced in the original paper. The results obtained from the CULane dataset showed close agreement with the original findings, with deviations of less than 1% on most of the metrics. This suggests the reproducibility and reliability of the claims made by the authors. However, in experiments on the TuSimple dataset, substantial disparities were noted between our results and those reported in the original paper. The probable causes of these inconsistencies are discussed in the study.What was easy -Obtaining the proposed results on the CULane dataset was readily achievable. The codebase provided by the authors was well-documented and functional. Owing to the modularity of the code, further experiments could be run with minimal changes overall. Porting the codebase to PyTorch Lightning was also facile.What was difficult -Using the LLAMAS dataset proved to be a challenge for resource constrained students owing to its size. We were eventually unable to set up experimentation on that dataset. Limited computational resources proved to be a challenge even for the other datasets, with each epoch taking over 2 hours on CULane. Total training time
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.