Abstract. Training an object class detector typically requires a large set of images annotated with bounding-boxes, which is expensive and time consuming to create. We propose novel approach to annotate object locations which can substantially reduce annotation time. We first track the eye movements of annotators instructed to find the object and then propose a technique for deriving object bounding-boxes from these fixations. To validate our idea, we collected eye tracking data for the trainval part of 10 object classes of Pascal VOC 2012 (6,270 images, 5 observers). Our technique correctly produces bounding-boxes in 50% of the images, while reducing the total annotation time by factor 6.8× compared to drawing bounding-boxes. Any standard object class detector can be trained on the bounding-boxes predicted by our model. Our large scale eye tracking dataset is available at groups.inf.ed.ac.uk/calvin/eyetrackdataset/.
Referring expression generation (REG) presents the converse problem to visual search: given a scene and a specified target, how does one generate a description which would allow somebody else to quickly and accurately locate the target?Previous work in psycholinguistics and natural language processing has failed to find an important and integrated role for vision in this task. That previous work, which relies largely on simple scenes, tends to treat vision as a pre-process for extracting feature categories that are relevant to disambiguation. However, the visual search literature suggests that some descriptions are better than others at enabling listeners to search efficiently within complex stimuli. This paper presents a study testing whether participants are sensitive to visual features that allow them to compose such “good” descriptions. Our results show that visual properties (salience, clutter, area, and distance) influence REG for targets embedded in images from the Where's Wally? books. Referring expressions for large targets are shorter than those for smaller targets, and expressions about targets in highly cluttered scenes use more words. We also find that participants are more likely to mention non-target landmarks that are large, salient, and in close proximity to the target. These findings identify a key role for visual salience in language production decisions and highlight the importance of scene complexity for REG.
Previous work has demonstrated that search for a target in noise is consistent with the predictions of the optimal search strategy, both in the spatial distribution of fixation locations and in the number of fixations observers require to find the target. In this study we describe a challenging visual-search task and compare the number of fixations required by human observers to find the target to predictions made by a stochastic search model. This model relies on a target-visibility map based on human performance in a separate detection task. If the model does not detect the target, then it selects the next saccade by randomly sampling from the distribution of saccades that human observers made. We find that a memoryless stochastic model matches human performance in this task. Furthermore, we find that the similarity in the distribution of fixation locations between human observers and the ideal observer does not replicate: Rather than making the signature doughnut-shaped distribution predicted by the ideal search strategy, the fixations made by observers are best described by a central bias. We conclude that, when searching for a target in noise, humans use an essentially random strategy, which achieves near optimal behavior due to biases in the distributions of saccades we have a tendency to make. The findings reconcile the existence of highly efficient human search performance with recent studies demonstrating clear failures of optimality in single and multiple saccade tasks.
Object detection and identification are fundamental to human vision, and there is mounting evidence that objects guide the allocation of visual attention. However, the role of objects in tasks involving multiple modalities is less clear. To address this question, we investigate object naming, a task in which participants have to verbally identify objects they see in photorealistic scenes. We report an eye-tracking study that investigates which features (attentional, visual, and linguistic) influence object naming. We find that the amount of visual attention directed toward an object, its position and saliency, along with linguistic factors such as word frequency, animacy, and semantic proximity, significantly influence whether the object will be named or not. We then ask how features from different modalities are combined during naming, and find significant interactions between saliency and position, saliency and linguistic features, and attention and position. We conclude that when the cognitive system performs tasks such as object naming, it uses input from one modality to constraint or enhance the processing of other modalities, rather than processing each input modality independently.
Over the last thirty years evaluation of texture analysis algorithms has been dominated by two databases: Brodatz has typically been used to provide single images of approximately 100 texture classes, while CUReT consists of multiple images of 61 physical samples captured under a variety of illumination conditions. While many highly successful approaches have been developed for classification, the challenging question of measuring perceived inter-class texture similarity has rarely been addressed. In this paper we introduce a new texture database which includes a collection of 334 samples together with perceptual similarity data collected from experiments with 30 human observers. We have tested four of the leading texture algorithms and they provide accurate (≈ 100%) performance in a CUReT style classification task, however, a second experiment shows that resulting inter-class distances do not correlate well with the perceptual data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.