Scaling Egocentric Vision: The "Equation missing" Dataset

Damen, Dima; Doughty, Hazel; Farinella, Giovanni Maria; Fidler, Sanja; Furnari, Antonino; Kazakos, Evangelos; Moltisanti, Davide; Munro, Jonathan; Perrett, Toby; Price, Will; Wray, Michael

doi:10.1007/978-3-030-01225-0_44

Cited by 482 publications

(931 citation statements)

References 44 publications

Supporting

Mentioning

926

Contrasting

Unclassified

Order By: Relevance

“…Dataset. Our previous work, EPIC Kitchens [8], offers a unique opportunity to test domain adaptation for finegrained action recognition, as it is recorded in 32 environments. Similar to previous works for action recognition [14,19], we evaluate on pairs of domains.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…In testing, as in [58], we use an average over 5 temporal windows, equidistant within the segment. We use the RGB and Optical Flow frames provided publicly [8]. The output of F is the result of the final average pooling layer of I3D, with 1024 dimensions.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…However, due to the difficulty in collecting and annotating such fine-grained actions, many datasets collect long untrimmed sequences. These contain several fine-grained actions from a single [43,50] or few [8,47] environments. Figure 2 shows the recent surge in large-scale finegrained action datasets.…”

Section: Introductionmentioning

confidence: 99%

“…• We show that multi-modal self-supervision, applied to both source and unlabelled target data, can be used for Figure 2: Fine-grained action datasets [8,17,26,28,38,42,46,47,50], x-axis: number of action segments per environment (ape), y-axis: dataset size divided by ape. EPIC-Kitchens [8] offers the largest ape relative to its size.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Munro

Damen

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

Fine-grained action recognition datasets exhibit environmental bias, where multiple video sequences are captured from a limited number of environments. Training a model in one environment and deploying in another results in a drop in performance due to an unavoidable domain shift. Unsupervised Domain Adaptation (UDA) approaches have frequently utilised adversarial training between the source and target domains. However, these approaches have not explored the multi-modal nature of video within each domain. In this work we exploit the correspondence of modalities as a self-supervised alignment approach for UDA in addition to adversarial alignment (Fig. 1).We test our approach on three kitchens from our largescale dataset, EPIC-Kitchens [8], using two modalities commonly employed for action recognition: RGB and Optical Flow. We show that multi-modal self-supervision alone improves the performance over source-only training by 2.4% on average. We then combine adversarial training with multi-modal self-supervision, showing that our approach outperforms other UDA methods by 3%.

show abstract

Section: Implementation Detailsmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Munro

Damen

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

Self Cite

View full text Add to dashboard Cite

show abstract

“…In particular, our contributions can be summarized as follows: (i) We provide an extensive evaluation and comparison with published methods of the proposed multimodal architecture on the EPIC-Kitchens dataset [12] (ii) In addition to action performance, we provide for the first time a detailed results on the object and verb components. The rest of the paper is organized as follows.…”

Section: Introductionmentioning

confidence: 99%

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

Cartas

Luque

Radeva

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Our interaction with the world is an inherently multimodal experience. However, the understanding of humanto-object interactions has historically been addressed focusing on a single modality. In particular, a limited number of works have considered to integrate the visual and audio modalities for this purpose. In this work, we propose a multimodal approach for egocentric action recognition in a kitchen environment that relies on audio and visual information. Our model combines a sparse temporal sampling strategy with a late fusion of audio, spatial, and temporal streams. Experimental results on the EPIC-Kitchens dataset show that multimodal integration leads to better performance than unimodal approaches. In particular, we achieved a 5.18% improvement over the state of the art on verb classification.

show abstract

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

Liu

Rehg

2018

Computer Vision – ECCV 2018

204

290

View full text Add to dashboard Cite

We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera. To facilitate our research, we first introduce the EGTEA Gaze+ dataset. Our dataset comes with videos, gaze tracking data, hand masks and action annotations, thereby providing the most comprehensive benchmark for First Person Vision (FPV). Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV. Our method describes the participant's gaze as a probabilistic variable and models its distribution using stochastic units in a deep network. We further sample from these stochastic units, generating an attention map to guide the aggregation of visual features for action recognition. Our method is evaluated on our EGTEA Gaze+ dataset and achieves a performance level that exceeds the state-of-the-art by a significant margin. More importantly, we demonstrate that our model can be applied to larger scale FPV dataset-EPIC-Kitchens even without using gaze, offering new state-of-the-art results on FPV action recognition.

show abstract

Scaling Egocentric Vision: The "Equation missing" Dataset

Cited by 482 publications

References 44 publications

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Multi-Modal Domain Adaptation for Fine-Grained Action Recognition

Seeing and Hearing Egocentric Actions: How Much Can We Learn?

In the Eye of Beholder: Joint Learning of Gaze and Actions in First Person Video

Contact Info

Product

Resources

About