Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

Bambach, Sven; Lee, Stefan; Crandall, David J.; Yu, Chen

doi:10.1109/iccv.2015.226

Cited by 358 publications

(344 citation statements)

References 29 publications

Supporting

Mentioning

338

Contrasting

Unclassified

Order By: Relevance

“…We collected a dataset of first-person video from interacting subjects [1], using Google Glass to capture video (720p30) from each person’s viewpoint, as illustrated in Figure 1. The subjects were asked to perform four different activities: (1) playing cards; (2) playing fast chess; (3) solving a jigsaw puzzle; and (4) playing Jenga (a 3d puzzle game).…”

Section: Hand Interactionsmentioning

confidence: 99%

“…We briefly describe the approach here; more details, as well as an in-depth quantitative evaluation, are presented elsewhere [1]. The hand extraction process consists of two major steps: detection, which tries to coarsely locate hands in each frame, and segmentation, which estimates the fine-grained pixel-level shape of each hand.…”

Section: Hand Interactionsmentioning

confidence: 99%

“…While not perfect, this approach produces reasonable results, with a mean average precision (mAP) of 0.74 for detection and a pixel-level intersection-over-union of 0.56 for segmentation on our dataset [1]. …”

Section: Hand Interactionsmentioning

confidence: 99%

See 2 more Smart Citations

Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View

Bambach

Crandall

2015

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

Wearable devices are becoming part of everyday life, from first-person cameras (GoPro, Google Glass), to smart watches (Apple Watch), to activity trackers (FitBit). These devices are often equipped with advanced sensors that gather data about the wearer and the environment. These sensors enable new ways of recognizing and analyzing the wearer’s everyday personal activities, which could be used for intelligent human-computer interfaces and other applications. We explore one possible application by investigating how egocentric video data collected from head-mounted cameras can be used to recognize social activities between two interacting partners (e.g. playing chess or cards). In particular, we demonstrate that just the positions and poses of hands within the first-person view are highly informative for activity recognition, and present a computer vision approach that detects hands to automatically estimate activities. While hand pose detection is imperfect, we show that combining evidence across first-person views from the two social partners significantly improves activity recognition accuracy. This result highlights how integrating weak but complimentary sources of evidence from social partners engaged in the same task can help to recognize the nature of their interaction.

show abstract

Section: Hand Interactionsmentioning

confidence: 99%

Section: Hand Interactionsmentioning

confidence: 99%

See 1 more Smart Citation

Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View

Bambach

Crandall

2015

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

Self Cite

View full text Add to dashboard Cite

show abstract

“…[23,20] reason about state changes in household objects, and [17,54] reason about human-object interactions. Adding mid-level cues such as face, gaze, and hands has also been investigated by [37,19,18,49,6]. Hybrid approaches [43,38,56] utilize both object and motion information.…”

Section: Related Workmentioning

confidence: 99%

Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals

Nakamura

Yeung

Alahi

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Physiological signals such as heart rate can provide valuable information about an individual's state and activity. However, existing work on computer vision has not yet explored leveraging these signals to enhance egocentric video understanding. In this work, we propose a model for reasoning on multimodal data to jointly predict activities and energy expenditures. We use heart rate signals as privileged self-supervision to derive energy expenditure in a training stage. A multitask objective is used to jointly optimize the two tasks. Additionally, we introduce a dataset that contains 31 hours of egocentric video augmented with heart rate and acceleration signals. This study can lead to new applications such as a visual calorie counter.

show abstract

“…According to survey of [11], the most commonly explored objective of egocentric vision is object recognition and tracking. Furthermore, hands are among the most common objects in the user's field of view, and a proper detection, localization, and tracking could be a main input for other objectives, such as gesture recognition, understanding hand-object interactions, and activity recognition [5,[12][13][14][15][16][17][18][19][20]. Recently, egocentric pixel-level hand detection has attracted more and more attention.…”

Section: Related Workmentioning

confidence: 99%

Coarse-to-fine online learning for hand segmentation in egocentric video

Zhao

Luo

Quan

2018

J Image Video Proc.

View full text Add to dashboard Cite

Hand segmentation is one of the most fundamental and crucial steps for egocentric human-computer interaction. The special egocentric view brings new challenges to hand segmentation tasks, such as the unpredictable environmental conditions. The performance of traditional hand segmentation methods depend on abundant manually labeled training data. However, these approaches do not appropriately capture the whole properties of egocentric human-computer interaction for neglecting the user-specific context. It is only necessary to build a personalized hand model of the active user. Based on this observation, we propose an online-learning hand segmentation approach without using manually labeled data for training. Our approach consists of top-down classifications and bottom-up optimizations. More specifically, we divide the segmentation task into three parts, a frame-level hand detection which detects the presence of the interactive hand using motion saliency and initializes hand masks for online learning, a superpixel-level hand classification which coarsely segments hand regions from which stable samples are selected for next level, and a pixel-level hand classification which produces a fine-grained hand segmentation. Based on the pixel-level classification result, we update the hand appearance model and optimize the upper layer classifier and detector. This online-learning strategy makes our approach robust to varying illumination conditions and hand appearances. Experimental results demonstrate the robustness of our approach.

show abstract

Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions

Cited by 358 publications

References 29 publications

Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View

Viewpoint Integration for Hand-Based Recognition of Social Interactions from a First-Person View

Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals

Coarse-to-fine online learning for hand segmentation in egocentric video

Contact Info

Product

Resources

About