Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction

Mâsse, Benoı̂t; Ba, Silèye; Horaud, Radu

doi:10.1109/tpami.2017.2782819

Cited by 70 publications

(38 citation statements)

References 39 publications

(102 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Depending on the available groundtruth annotations, we measure AP at frame level, considering each pair as an independent sample, or at shot-level, if more detailed annotations are not available. Frame level is used for UCO-LAEO and AVA-LAEO and, following previous work [16,18], shot level for TVHID.…”

Section: Evaluation Protocols and Scoring Methodologymentioning

confidence: 99%

See 1 more Smart Citation

LAEO-Net: Revisiting People Looking at Each Other in Videos

Marín-Jiménez¹,

Kalogeiton

Medina-Suarez

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Figure 1: Intimacy or hostility? Head pose, along with body pose and facial expressions, is a rich source of information for interpreting human interactions. Being able to automatically understand the non-verbal cues provided by the relative head orientations of people in a scene enables a new level of human-centric video understanding. Green and red/orange heads represent LAEO and non-LAEO cases, respectively. Video source of second row: https://youtu.be/B3eFZMvNS1UAbstract Capturing the 'mutual gaze' of people is essential for understanding and interpreting the social interactions between them. To this end, this paper addresses the problem of detecting people Looking At Each Other (LAEO) in video sequences. For this purpose, we propose LAEO-Net, a new deep CNN for determining LAEO in videos. In contrast to previous works, LAEO-Net takes spatio-temporal tracks as input and reasons about the whole track. It consists of three branches, one for each character's tracked head and one for their relative position. Moreover, we introduce two new LAEO datasets: UCO-LAEO and AVA-LAEO. A thorough experimental evaluation demonstrates the ability of LAEO-Net to successfully determine if two people are LAEO and the temporal window where it happens. Our model achieves state-of-the-art results on the existing TVHID-LAEO video dataset, significantly outperforming previous approaches. Finally, we apply LAEO-Net to social network analysis, where we automatically infer the social relationship between pairs of people based on the frequency and duration that they LAEO.

show abstract

Section: Evaluation Protocols and Scoring Methodologymentioning

confidence: 99%

“…This problem is addressed in [23] with a deep learning model that reasons about human gaze and 3D geometrical relationships between different views of the same scene. The authors of [18] consider scenarios where multiple people are involved in a social interaction. Given that the eyes of a person are not always visible (e.g.…”

Section: Related Workmentioning

confidence: 99%

LAEO-Net: Revisiting People Looking at Each Other in Videos

Marín-Jiménez¹,

Kalogeiton

Medina-Suarez

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…Both the training and testing data involved 64 2D+t sequences; herein, the training size is 25, while the testing size is 39. The data had varying spatial (0.27-0.77 mm) and temporal resolution (11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30). The training data were annotated by CLUST as ground truth (center of blood vessels) of fiducial features throughout the acquisition sequence.…”

Section: A Liver Ultrasound Data and Attention-aware Video Generationmentioning

confidence: 99%

Attention‐aware fully convolutional neural network with convolutional long short‐term memory network for ultrasound‐based motion tracking

Huang

Lü

et al. 2019

Medical Physics

View full text Add to dashboard Cite

Purpose One of the promising options for motion management in radiation therapy (RT) is the use of LINAC‐compatible robotic‐arm‐mounted ultrasound imaging system due to its high soft tissue contrast, real‐time capability, absence of ionizing radiation, and low cost. The purpose of this work is to develop a novel deep learning‐based real‐time motion tracking strategy for ultrasound image‐guided RT. Methods The proposed tracker combined the attention‐aware fully convolutional neural network (FCNN) and the convolutional long short‐term memory network (CLSTM) that is end‐to‐end trainable. The glimpse sensor module was built inside the attention‐aware FCNN to discard majority of background by focusing on a region containing the object of interest. FCNN extracted discriminating spatial features of glimpse to facilitate temporal modeling for CLSTM. The saliency mask computed from CLSTM refined the features particular to the tracked landmarks. Moreover, the multitask loss strategy including bounding box loss, localization loss, saliency loss, and adaptive loss weighting term was utilized to facilitate training convergence and avoid over/underfitting. The tracker was tested on the databases provided by MICCAI 2015 challenges, and the ground truth data were obtained with the help of brute force‐based template matching technology. Results The mean tracking error of 0.97 ± 0.52 mm and maximum tracking error of 1.94 mm were observed for 85 point landmarks across 39 ultrasound cases compared to the ground truth annotations. The tracking speed per frame per landmark with the GPU implementation ranged from 66 and 101 frames per second, which largely exceeded the ultrasound imaging rate. Conclusion The results demonstrated the robustness and accuracy of the proposed deep learning‐based motion estimation, despite the existence of some known shortcomings of ultrasound imaging such as speckle noise. The tracking speed of the system was found to be remarkable, sufficiently fast for real‐time applications in RT environment. The approach provides a valuable tool to guide RT treatment with beam gating or multileaf collimator (MLC) tracking in real time.

show abstract

“…Yun et al (Yun et al, 2012) evaluates two-person interaction based on a wide variety of geometric body features, such as joint keypoints and distances, and joint-to-plane distances. Massé et al (Massé et al, 2017) propose a framework where correlation between head pose and eye gaze is used to estimate the VFOA. The authors of some of these works also address the importance of these features to estimate attention in the field of HCI.…”

Section: Vison-based Attention Estimationmentioning

confidence: 99%

Subjective Annotations for Vision-based Attention Level Estimation

Coifman¹,

Rohoska

Kristoffersen

et al. 2019

Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Application

View full text Add to dashboard Cite

Attention level estimation systems have a high potential in many use cases, such as human-robot interaction, driver modeling and smart home systems, since being able to measure a person's attention level opens the possibility to natural interaction between humans and computers. The topic of estimating a human's visual focus of attention has been actively addressed recently in the field of HCI. However, most of these previous works do not consider attention as a subjective, cognitive attentive state. New research within the field also faces the problem of the lack of annotated datasets regarding attention level in a certain context. The novelty of our work is two-fold: First, we introduce a new annotation framework that tackles the subjective nature of attention level and use it to annotate more than 100,000 images with three attention levels and second, we introduce a novel method to estimate attention levels, relying purely on extracted geometric features from RGB and depth images, and evaluate it with a deep learning fusion framework. The system achieves an overall accuracy of 80.02%. Our framework and attention level annotations are made publicly available.

show abstract

Tracking Gaze and Visual Focus of Attention of People Involved in Social Interaction

Cited by 70 publications

References 39 publications

LAEO-Net: Revisiting People Looking at Each Other in Videos

LAEO-Net: Revisiting People Looking at Each Other in Videos

Attention‐aware fully convolutional neural network with convolutional long short‐term memory network for ultrasound‐based motion tracking

Subjective Annotations for Vision-based Attention Level Estimation

Contact Info

Product

Resources

About