Yifei Huang scite author profile

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of dailylife activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards, with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception.

show abstract

Predicting Gaze in Egocentric Video by Learning Task-Dependent Attention Transition

Huang

Cai

et al. 2018

View full text Add to dashboard Cite

We present a new computational model for gaze prediction in egocentric videos by exploring patterns in temporal shift of gaze fixations (attention transition) that are dependent on egocentric manipulation tasks. Our assumption is that the high-level context of how a task is completed in a certain way has a strong influence on attention transition and should be modeled for gaze prediction in natural dynamic scenes. Specifically, we propose a hybrid model based on deep neural networks which integrates task-dependent attention transition with bottomup saliency prediction. In particular, the task-dependent attention transition is learned with a recurrent neural network to exploit the temporal context of gaze fixations, e.g. looking at a cup after moving gaze away from a grasped bottle. Experiments on public egocentric activity datasets show that our model significantly outperforms state-of-the-art gaze prediction methods and is able to learn meaningful transition of human attention.

show abstract

Semantic Aware Attention Based Deep Object Co-segmentation

Chen

Huang

Nakayama

2019

View full text Add to dashboard Cite

Object co-segmentation is the task of segmenting the same objects from multiple images. In this paper, we propose the Attention Based Object Co-Segmentation for object co-segmentation that utilize a novel attention mechanism in the bottleneck layer of deep neural network for the selection of semantically related features. Furthermore, we take the benefit of attention learner and propose an algorithm to segment multi-input images in linear time complexity. Experiment results demonstrate that our model achieves state of the art performance on multiple datasets, with a significant reduction of computational time.

show abstract

CLRNet: Cross Layer Refinement Network for Lane Detection

et al. 2022

View full text Add to dashboard Cite

Scope of Reproducibility -The following work is a reproducibility report for CLRNet: Cross Layer Refinement Network for Lane Detection [1]. The basic code was made available by the author at this https url. The paper proposes a novel Cross Layer Refinement Network to utilize both high and low level features for lane detection. The authors assert that the proposed technique sets the new state-of-the-art on three lane-detection benchmarks.Methodology -The proposed model employs a two-stage approach to lane detection. Initially, coarse lane detection is achieved through the extraction of high-level semantic features. This is followed by refinement of the output based on low-level features, aimed at enhancing the localization accuracy of the model. The authors' code was used to benchmark the claims. Some further experiments were investigated thereafter. Kaggle, a free-to-use platform for deep learning experiments, was used to train these models. We have reproduced the code base in Pytorch Lightning and found consistent results across the board. Results -The central claims presented by the authors were subject to reproduction and verification. The validity of the claims was evaluated using two out of the three datasets referenced in the original paper. The results obtained from the CULane dataset showed close agreement with the original findings, with deviations of less than 1% on most of the metrics. This suggests the reproducibility and reliability of the claims made by the authors. However, in experiments on the TuSimple dataset, substantial disparities were noted between our results and those reported in the original paper. The probable causes of these inconsistencies are discussed in the study.What was easy -Obtaining the proposed results on the CULane dataset was readily achievable. The codebase provided by the authors was well-documented and functional. Owing to the modularity of the code, further experiments could be run with minimal changes overall. Porting the codebase to PyTorch Lightning was also facile.What was difficult -Using the LLAMAS dataset proved to be a challenge for resource constrained students owing to its size. We were eventually unable to set up experimentation on that dataset. Limited computational resources proved to be a challenge even for the other datasets, with each epoch taking over 2 hours on CULane. Total training time

show abstract

Improving Action Segmentation via Graph-Based Temporal Reasoning

Huang

Sugano

Sato

2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yifei Huang

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Predicting Gaze in Egocentric Video by Learning Task-Dependent Attention Transition

Semantic Aware Attention Based Deep Object Co-segmentation

CLRNet: Cross Layer Refinement Network for Lane Detection

Improving Action Segmentation via Graph-Based Temporal Reasoning

Contact Info

Product

Resources

About