How Do Drivers Allocate Their Potential Attention? Driving Fixation Prediction via Convolutional Neural Networks

Deng, Tao; Yan, Hongmei; Qin, Long; Ngo, Thuyen; Manjunath, B.S.

doi:10.1109/tits.2019.2915540

Cited by 83 publications

(48 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We trained a convolutiondeconvolution gaze network (Palazzi et al 2018;Zhang et al 2018b;2018a;Deng et al 2019) with KL divergence ( = 1e − 10) as loss function to predict human gaze positions. A separate network is trained for each game.…”

Section: Baseline Model and Resultsmentioning

confidence: 99%

“…Considerable evidence has shown that human gaze can be considered as an overt behavioral signal that encodes a wealth of information about both the motivation behind an action and the anticipated reward of an action (Hayhoe and Ballard 2005;Johnson et al 2014). Recent work has also proposed learning visual attention models from human gaze as an intermediate step towards learning the decision policy, and this intermediate signal has been shown to improve policy learning (Li, Liu, and Rehg 2018;Zhang et al 2018b;Xia et al 2019;Chen et al 2019;Liu et al 2019;Deng et al 2019) Addressing the demands and challenges described above, we collected a large-scale dataset of humans playing Atari video games -one of the most widely used task domain in RL and IL research. The dataset is named Atari-HEAD (Atari Human Eye-Tracking And Demonstration) 1 .…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset

Zhang

Walshe

Liu

et al. 2020

AAAI

View full text Add to dashboard Cite

Large-scale public datasets have been shown to benefit research in multiple areas of modern artificial intelligence. For decision-making research that requires human data, high-quality datasets serve as important benchmarks to facilitate the development of new methods by providing a common reproducible standard. Many human decision-making tasks require visual attention to obtain high levels of performance. Therefore, measuring eye movements can provide a rich source of information about the strategies that humans use to solve decision-making tasks. Here, we provide a large-scale, high-quality dataset of human actions with simultaneously recorded eye movements while humans play Atari video games. The dataset consists of 117 hours of gameplay data from a diverse set of 20 games, with 8 million action demonstrations and 328 million gaze samples. We introduce a novel form of gameplay, in which the human plays in a semi-frame-by-frame manner. This leads to near-optimal game decisions and game scores that are comparable or better than known human records. We demonstrate the usefulness of the dataset through two simple applications: predicting human gaze and imitating human demonstrated actions. The quality of the data leads to promising results in both tasks. Moreover, using a learned human gaze model to inform imitation learning leads to an 115% increase in game performance. We interpret these results as highlighting the importance of incorporating human visual attention in models of decision making and demonstrating the value of the current dataset to the research community. We hope that the scale and quality of this dataset can provide more opportunities to researchers in the areas of visual attention, imitation learning, and reinforcement learning.

show abstract

Section: Baseline Model and Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset

Zhang

Walshe

Liu

et al. 2020

AAAI

View full text Add to dashboard Cite

show abstract

“…Ye Xia et al present in [51] an in-lab dataset called Berkeley DeepDrive Attention (BDD-A), together with a system which includes a convolutional LSTM network and a method to show relevant frames to the model more frequently. Tao Deng et al [52] provide a traffic driving video dataset with fixations, together with a saliency detection model based on compact convolutional-deconvolutional neural networks (CDNN). Either BDD-A [51] or CDNN [52] databases have more diverse fixations than DR(eye)VE [49], but recorded in a laboratory and not under real driving conditions.…”

Section: B Visual Attention For Autonomous Vehiclesmentioning

confidence: 99%

“…Tao Deng et al [52] provide a traffic driving video dataset with fixations, together with a saliency detection model based on compact convolutional-deconvolutional neural networks (CDNN). Either BDD-A [51] or CDNN [52] databases have more diverse fixations than DR(eye)VE [49], but recorded in a laboratory and not under real driving conditions. Moreover, DR(eye)VE [49] provides labels for contextual conditions, which enables us to demonstrate our model's capabilities.…”

Section: B Visual Attention For Autonomous Vehiclesmentioning

confidence: 99%

“…Weights are shared between all the input capsule types, as in [19]. Moreover, we report the results obtained by the three related approaches in the state-of-the-art introduced in Section II-B: BDD-A [51], CDNN [52] and DR(eye)VE [49]. In order to provide a fair comparison, all methods have been re-trained by using the DR(eye)VE [49] database.…”

Section: ) Experimental Setup and Baselinesmentioning

confidence: 99%

See 1 more Smart Citation

Interpretable Global-Local Dynamics for the Prediction of Eye Fixations in Autonomous Driving Scenarios

2020

View full text Add to dashboard Cite

Human eye movements while driving reveal that visual attention largely depends on the context in which it occurs. Furthermore, an autonomous vehicle which performs this function would be more reliable if its outputs were understandable. Capsule Networks have been presented as a great opportunity to explore new horizons in the Computer Vision field, due to their capability to structure and relate latent information. In this paper, we present a hierarchical approach for the prediction of eye fixations in autonomous driving scenarios. Context-driven visual attention can be modeled by considering different conditions which, in turn, are represented as combinations of several spatio-temporal features. With the aim of learning these conditions, we have built an encoder-decoder network which merges visual features' information using a global-local definition of capsules. Two types of capsules are distinguished: representational capsules for features and discriminative capsules for conditions. The latter and the use of eye fixations recorded with wearable eye tracking glasses allow the model to learn both to predict contextual conditions and to estimate visual attention, by means of a multi-task loss function. Experiments show how our approach is able to express either frame-level (global) or pixel-wise (local) relationships between features and contextual conditions, allowing for interpretability while maintaining or improving the performance of black-box related systems in the literature. Indeed, our proposal offers an improvement of 29% in terms of information gain with respect to the best performance reported in the literature. INDEX TERMS Top-down visual attention, eye fixation prediction, context-based learning, interpretability, capsule networks, convolutional neural networks, autonomous driving.

show abstract