In recent years, several authors have reported that spectral saliency detection methods provide state-of-the-art performance in predicting human gaze in images (see, e.g., [13]). We systematically integrate and evaluate quaternion DCT-and FFT-based spectral saliency detection [3,4], weighted quaternion color space components [5], and the use of multiple resolutions [1]. Furthermore, we propose the use of the eigenaxes and eigenangles for spectral saliency models that are based on the quaternion Fourier transform. We demonstrate the outstanding performance on the Bruce-Tsotsos (Toronto), Judd (MIT), and KootstraSchomacker eye-tracking data sets.
We combine and extend the previous work on DCT-based image signatures and face detection to determine the visual saliency. To this end, we transfer the scalar definition of image signatures to quaternion images and thus introduce a novel saliency method using quaternion type-II DCT image signatures. Furthermore, we use MCT-based face detection to model the important influence of faces on the visual saliency using rotated elliptical Gaussian weight functions and evaluate several integration schemes. In order to demonstrate the performance of the proposed methods, we evaluate our approach on the Bruce-Tsotsos (Toronto) [2] and Cerf (FIFA) [3] benchmark eye-tracking data sets. Additionally, we present evaluation results on the BruceTsotsos data set of the most important spectral saliency approaches. We achieve state-of-the-art results in terms of the well-established area under curve (AUC) measure on the Bruce-Tsotsos data set and come close to the ideal AUC on the Cerf data set -with less than one millisecond to calculate the bottom-up QDCT saliency map.
Identifying verbally and non-verbally referred-to objects is an important aspect of human-robot interaction. Most importantly, it is essential to achieve a joint focus of attention and, thus, a natural interaction behavior. In this contribution, we introduce a saliencybased model that reflects how multi-modal referring acts influence the visual search, i.e. the task to find a specific object in a scene. Therefore, we combine positional information obtained from pointing gestures with contextual knowledge about the visual appearance of the referred-to object obtained from language. The available information is then integrated into a biologically-motivated saliency model that forms the basis for visual search. We prove the feasibility of the proposed approach by presenting the results of an experimental evaluation.
Abstract-When persons interact, non-verbal cues are used to direct the attention of persons towards objects of interest. Achieving joint attention this way is an important aspect of natural communication. Most importantly, it allows to couple verbal descriptions with the visual appearance of objects, if the referred-to object is non-verbally indicated. In this contribution, we present a system that utilizes bottom-up saliency and pointing gestures to efficiently identify pointed-at objects. Furthermore, the system focuses the visual attention by steering a pan-tilt-zoom camera towards the object of interest and thus provides a suitable model-view for SIFT-based recognition and learning. We demonstrate the practical applicability of the proposed system through experimental evaluation in different environments with multiple pointers and objects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.