Hazel Doughty scite author profile

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.

show abstract

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

Damen

et al. 2021

View full text Add to dashboard Cite

This paper introduces the pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS. The effort culminates in EPIC-KITCHENS-100, a collection of 100 hours, 20M frames, 90K actions in 700 variable-length videos, capturing long-term unscripted activities in 45 environments, using head-mounted cameras. Compared to its previous version (Damen in Scaling egocentric vision: ECCV, 2018), EPIC-KITCHENS-100 has been annotated using a novel pipeline that allows denser (54% more actions per minute) and more complete annotations of fine-grained actions (+128% more action segments). This collection enables new challenges such as action detection and evaluating the “test of time”—i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later. The dataset is aligned with 6 challenges: action recognition (full and weak supervision), action detection, action anticipation, cross-modal retrieval (from captions), as well as unsupervised domain adaptation for action recognition. For each challenge, we define the task, provide baselines and evaluation metrics.

show abstract

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Damen

Doughty

Farinella

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

101

105

View full text Add to dashboard Cite

Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

Doughty

Damen

Mayol-Cuevas

2018

View full text Add to dashboard Cite

This paper presents a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill, and learns shared features when a pair of videos exhibit comparable skill levels. Results demonstrate our method is applicable across tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 83% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters.We see this work as effort toward the automated organization of how-to video collections and overall, generic skill determination in video.

show abstract

The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos

Doughty

Mayol-Cuevas

Damen

2019

View full text Add to dashboard Cite

We present a new model to determine relative skill from long videos, through learnable temporal attention modules. Skill determination is formulated as a ranking problem, making it suitable for common and generic tasks. However, for long videos, parts of the video are irrelevant for assessing skill, and there may be variability in the skill exhibited throughout a video. We therefore propose a method which assesses the relative overall level of skill in a long video by attending to its skill-relevant parts.Our approach trains temporal attention modules, learned with only video-level supervision, using a novel rank-aware loss function. In addition to attending to taskrelevant video parts, our proposed loss jointly trains two attention modules to separately attend to video parts which are indicative of higher (pros) and lower (cons) skill. We evaluate our approach on the EPIC-Skills dataset and additionally annotate a larger dataset from YouTube videos for skill determination with five previously unexplored tasks. Our method outperforms previous approaches and classic softmax attention on both datasets by over 4% pairwise accuracy, and as much as 12% on individual tasks. We also demonstrate our model's ability to attend to rank-aware parts of the video.

show abstract

StopWatch: The Preliminary Evaluation of a Smartwatch-Based System for Passive Detection of Cigarette Smoking

Skinner

Stone

Doughty

et al. 2018

View full text Add to dashboard Cite

We present a low-cost, smartwatch-based system for passive detection of cigarette smoking. It uses data from the motion sensors in the watch to identify the signature hand movements of cigarette smoking. The system will provide the detailed measures of individual smoking behaviour needed for context-triggered just-in-time smoking cessation support systems, and to enable just-in-time adaptive interventions. More broadly, the system will enable researchers to obtain detailed measures of individual smoking behaviour in free-living conditions that are free from the recall errors and reporting biases associated with self-report of smoking.

show abstract

On Semantic Similarity in Video Retrieval

Wray

Doughty

Damen

2021

View full text Add to dashboard Cite

Action Modifiers: Learning From Adverbs in Instructional Videos

Doughty

Laptev

Mayol-Cuevas

et al. 2020

View full text Add to dashboard Cite

12 3 4

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Hazel Doughty

Scaling Egocentric Vision: The "Equation missing" Dataset

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos

StopWatch: The Preliminary Evaluation of a Smartwatch-Based System for Passive Detection of Cigarette Smoking

On Semantic Similarity in Video Retrieval

Action Modifiers: Learning From Adverbs in Instructional Videos

Contact Info

Product

Resources

About