Action Recognition in the Presence of One Egocentric and Multiple Static Cameras

Soran, Bilge; Farhadi, Ali; Shapiro, Linda G.

doi:10.1007/978-3-319-16814-2_12

Cited by 31 publications

(19 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, action recognition models in the literature rely on computer-vision based approaches to analyze 2D videos recorded by an egocentric camera, e.g., (Fathi et al, 2011 , 2012 ; Fathi and Rehg, 2013 ; Matsuo et al, 2014 ; Soran et al, 2015 ; Ma et al, 2016 ; Li et al, 2018 ; Furnari and Farinella, 2019 ; Sudhakaran et al, 2019 ; Liu et al, 2020 ). Whether using hand-crafted features (Fathi et al, 2011 , 2012 ; Fathi and Rehg, 2013 ; Matsuo et al, 2014 ; Soran et al, 2015 ; Ma et al, 2016 ; Furnari and Farinella, 2019 ) or learning end-to-end models (Li et al, 2018 ; Sudhakaran et al, 2019 ; Liu et al, 2020 ), the computer vision-based approaches to action recognition must also address the challenges of identifying and tracking activity-relevant objects. In contrast, we bypassed the challenges inherent in 2D image analysis by combining an eyetracker with a marker-based motion capture system.…”

Section: Discussionmentioning

confidence: 99%

“…Second, action recognition models in the literature rely on computer-vision based approaches to analyze 2D videos recorded by an egocentric camera, e.g., (Fathi et al, 2011(Fathi et al, , 2012Fathi and FIGURE 9 | Point clouds of the four activity-relevant objects involved in Activity 1 were segmented into multiple regions for finer spatial resolution: (A) pitcher, (B) pitcher lid, (C) spoon, and (D) mug. Rehg, 2013;Matsuo et al, 2014;Soran et al, 2015;Ma et al, 2016;Li et al, 2018;Furnari and Farinella, 2019;Sudhakaran et al, 2019;Liu et al, 2020). Whether using hand-crafted features (Fathi et al, 2011(Fathi et al, , 2012Fathi and Rehg, 2013;Matsuo et al, 2014;Soran et al, 2015;Ma et al, 2016;Furnari and Farinella, 2019) or learning end-to-end models (Li et al, 2018;Sudhakaran et al, 2019;Liu et al, 2020), the computer vision-based approaches to action recognition must also address the challenges of identifying and tracking activity-relevant objects.…”

Section: Comparisons To State-of-the-art Recognition Algorithmsmentioning

confidence: 99%

See 1 more Smart Citation

Toward Shared Autonomy Control Schemes for Human-Robot Systems: Action Primitive Recognition Using Eye Gaze Features

2020

View full text Add to dashboard Cite

The functional independence of individuals with upper limb impairment could be enhanced by teleoperated robots that can assist with activities of daily living. However, robot control is not always intuitive for the operator. In this work, eye gaze was leveraged as a natural way to infer human intent and advance action recognition for shared autonomy control schemes. We introduced a classifier structure for recognizing low-level action primitives that incorporates novel three-dimensional gaze-related features. We defined an action primitive as a triplet comprised of a verb, target object, and hand object. A recurrent neural network was trained to recognize a verb and target object, and was tested on three different activities. For a representative activity (making a powdered drink), the average recognition accuracy was 77% for the verb and 83% for the target object. Using a non-specific approach to classifying and indexing objects in the workspace, we observed a modest level of generalizability of the action primitive classifier across activities, including those for which the classifier was not trained. The novel input features of gaze object angle and its rate of change were especially useful for accurately recognizing action primitives and reducing the observational latency of the classifier.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Comparisons To State-of-the-art Recognition Algorithmsmentioning

confidence: 99%

Toward Shared Autonomy Control Schemes for Human-Robot Systems: Action Primitive Recognition Using Eye Gaze Features

2020

View full text Add to dashboard Cite

show abstract

“…Human reidentification by matching viewers in top-view and egocentric cameras have been tackled by establishing the correspondences between the views in [1]. Soran et al [29] utilize the information from one egocentric camera and multiple exocentric cameras to solve the action recognition task. Ardeshir et al [2] learn motion features of actions performed in ego-and exocentric domains to transfer motion information across the two domains.…”

Section: Relating Aerial and Ground-level Imagesmentioning

confidence: 99%

Cross-View Image Synthesis Using Conditional GANs

Regmi

Borji

2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

160

247

View full text Add to dashboard Cite

Learning to generate natural scenes has always been a challenging task in computer vision. It is even more painstaking when the generation is conditioned on images with drastically different views. This is mainly because understanding, corresponding, and transforming appearance and semantic information across the views is not trivial. In this paper, we attempt to solve the novel problem of cross-view image synthesis, aerial to street-view and vice versa, using conditional generative adversarial networks (cGAN). Two new architectures called Crossview Fork (X-Fork) and Crossview Sequential (X-Seq) are proposed to generate scenes with resolutions of 64×64 and 256×256 pixels. X-Fork architecture has a single discriminator and a single generator. The generator hallucinates both the image and its semantic segmentation in the target view. X-Seq architecture utilizes two cGANs. The first one generates the target image which is subsequently fed to the second cGAN for generating its corresponding semantic segmentation map. The feedback from the second cGAN helps the first cGAN generate sharper images. Both of our proposed architectures learn to generate natural images as well as their semantic segmentation maps. The proposed methods show that they are able to capture and maintain the true semantics of objects in source and target views better than the traditional image-to-image translation method which considers only the visual appearance of the scene. Extensive qualitative and quantitative evaluations support the effectiveness of our frameworks, compared to two state of the art methods, for natural scene generation across drastically different views.

show abstract

“…Several recent papers have shown the potential for combining first-person video analysis with evidence from other types of synchronized video, including from other firstperson cameras [3,29], multiple third-person cameras [26], or even hand-mounted cameras [5]. However, these papers assume that a single person appears in each video, avoiding the person-level correspondence problem.…”

Section: Related Workmentioning

confidence: 99%

“…Despite its importance, we are aware of very little work that tries to address this problem. Several recent papers propose using multiple cameras for joint first-person recognition [3,5,26,29], but make simplistic assumptions like that only one person appears in the scene. Using visual SLAM to infer first-person camera trajectory and map to third-person cameras (e.g., [17,19]) works well in some settings, but can fail for crowded environments when longterm precise localizations are needed and when first-person video has significant motion blur.…”

Section: Introductionmentioning

confidence: 99%

Identifying First-Person Camera Wearers in Third-Person Videos

Cheng

Lee

et al. 2017

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene. To do this, we need to establish person-level correspondences across first-and thirdperson videos, which is challenging because the camera wearer is not visible from his/her own egocentric video, preventing the use of direct feature matching. In this paper, we propose a new semi-Siamese Convolutional Neural Network architecture to address this novel challenge. We formulate the problem as learning a joint embedding space for first-and third-person videos that considers both spatial-and motion-domain cues. A new triplet loss function is designed to minimize the distance between correct first-and third-person matches while maximizing the distance between incorrect ones. This end-to-end approach performs significantly better than several baselines, in part by learning the first-and third-person features optimized for matching jointly with the distance measure itself.

show abstract

Action Recognition in the Presence of One Egocentric and Multiple Static Cameras

Cited by 31 publications

References 34 publications

Toward Shared Autonomy Control Schemes for Human-Robot Systems: Action Primitive Recognition Using Eye Gaze Features

Toward Shared Autonomy Control Schemes for Human-Robot Systems: Action Primitive Recognition Using Eye Gaze Features

Cross-View Image Synthesis Using Conditional GANs

Identifying First-Person Camera Wearers in Third-Person Videos

Contact Info

Product

Resources

About