Deep Head Pose: Gaze-Direction Estimation in Multimodal Video

Mukherjee, Sankha; Robertson, Neil

doi:10.1109/tmm.2015.2482819

Cited by 170 publications

(66 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…1. According to [24], we can classify existing computer vision based head pose estimation methods into two categories: learning based methods [1][2][3] [16] [25][26][27][28][29][30][31][32][33][34][35][36][37][38] that need large amount of training data and computational resources and geometry based methods [4][5][6][7][8][9][10] [ [39][40][41][42][43][44][45][46][47][48][49] that are fast but with a little lower accuracy, see section II for details. In this paper, as shown in Fig.…”

mentioning

confidence: 99%

Single image-based head pose estimation with spherical parametrization and 3D morphing

Yuan

Hou

et al. 2020

Pattern Recognition

View full text Add to dashboard Cite

mentioning

confidence: 99%

Single image-based head pose estimation with spherical parametrization and 3D morphing

Yuan

Hou

et al. 2020

Pattern Recognition

View full text Add to dashboard Cite

“…For example, Fathi et al [35] built a probabilistic generative model to simultaneously predict the sequence of gaze locations and the respective action label from first person view videos. Mukherjee and Robertson [38] estimated the gaze direction based on the head pose in multimodal videos, and managed to recover human-human/scene interactions. In the image domain, Recasens et al [36] proposed a method to detect the object regions being fixated at by human in the scene.…”

Section: Related Workmentioning

confidence: 99%

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

Wong

et al. 2020

IEEE Trans. Multimedia

View full text Add to dashboard Cite

The recent advances in instance-level detection tasks lay strong foundation for genuine comprehension of the visual scenes. However, the ability to fully comprehend a social scene is still in its preliminary stage. In this work, we focus on detecting human-object interactions (HOIs) in social scene images, which is demanding in terms of research and increasingly useful for practical applications. To undertake social tasks interacting with objects, humans direct their attention and move their body based on their intention. Based on this observation, we provide an unique computational perspective to explore human intention in HOI detection. Specifically, the proposed human intentiondriven HOI detection (iHOI) framework models human pose with the relative distances from body joints to the object instances. It also utilizes human gaze to guide the attended contextual regions in a weakly-supervised setting. In addition, we propose a hard negative sampling strategy to address the problem of misgrouping. We perform extensive experiments on two benchmark datasets, namely V-COCO and HICO-DET, and show that iHOI outperforms the existing approaches. The efficacy of each proposed component has also been validated.

show abstract

“…The gaze direction have been detected from combined video and depth signals [24,25] and utilized in the visual attention model to estimate human-to-human interaction. Human gaze has also been used for semantic mapping of human attention in the 3D environment [26].…”

Section: Kinect Sensor and Its Usagementioning

confidence: 99%

Predicting students’ attention in the classroom from Kinect facial and body features

Zaletelj

Košir

2017

J Image Video Proc.

124

View full text Add to dashboard Cite

This paper proposes a novel approach to automatic estimation of attention of students during lectures in the classroom. The approach uses 2D and 3D data obtained by the Kinect One sensor to build a feature set characterizing both facial and body properties of a student, including gaze point and body posture. Machine learning algorithms are used to train classifiers which estimate time-varying attention levels of individual students. Human observers' estimation of attention level is used as a reference. The comparison of attention prediction accuracy of seven classifiers is done on a data set comprising 18 subjects. Our best person-independent three-level attention classifier achieved moderate accuracy of 0.753, comparable to results of other studies in the field of student engagement. The results indicate that Kinect-based attention monitoring system is able to predict both students' attention over time as well as average attention levels and could be applied as a tool for non-intrusive automated analytics of the learning process.

show abstract

Deep Head Pose: Gaze-Direction Estimation in Multimodal Video

Cited by 170 publications

References 39 publications

Single image-based head pose estimation with spherical parametrization and 3D morphing

Single image-based head pose estimation with spherical parametrization and 3D morphing

Interact as You Intend: Intention-Driven Human-Object Interaction Detection

Predicting students’ attention in the classroom from Kinect facial and body features

Contact Info

Product

Resources

About