Humans rely on multiple sensory modalities when examining and reasoning over images. In this paper, we describe a new multimodal dataset that consists of gaze measurements and spoken descriptions collected in parallel during an image inspection task. The task was performed by multiple participants on 100 general-domain images showing everyday objects and activities. We demonstrate the usefulness of the dataset by applying an existing visual-linguistic data fusion framework in order to label important image regions with appropriate linguistic labels.
Understanding and characterizing perceptual expertise is a major bottleneck in developing intelligent systems. In knowledge-rich domains such as dermatology, perceptual expertise influences the diagnostic inferences made based on the visual input. This study uses eye movement data from 12 dermatology experts and 12 undergraduate novices while they inspected 34 dermatological images. This work investigates the differences in global and local temporal fixation patterns between the two groups using recurrence quantification analysis (RQA). The RQA measures reveal significant differences in both global and local temporal patterns between the two groups. Results show that experts tended to refixate previously inspected areas less often than did novices, and their refixations were more widely separated in time. Experts were also less likely to follow extended scan paths repeatedly than were novices. These results suggest the potential value of RQA measures in characterizing perceptual expertise. We also discuss potential use of the RQA method in understanding the interactions between experts' visual and linguistic behavior.
Human image understanding is reflected by individuals' visual and linguistic behaviors, but the meaningful computational integration and interpretation of their multimodal representations remain a challenge. In this paper, we expand a framework for capturing image-region annotations in dermatology, a domain in which interpreting an image is influenced by experts' visual perception skills, conceptual domain knowledge, and task-oriented goals. Our work explores the hypothesis that eye movements can help us understand experts' perceptual processes and that spoken language descriptions can reveal conceptual elements of image inspection tasks. We cast the problem of meaningfully integrating visual and linguistic data as unsupervised bitext alignment. Using alignment, we create meaningful mappings between physicians' eye movements, which reveal key areas of images, and spoken descriptions of those images. The resulting alignments are then used to annotate image regions with medical concept labels. Our alignment accuracy exceeds baselines using both exact and delayed temporal correspondence. Additionally, comparison of alignment accuracy between a method that identifies clusters in the images based on eye movement vs. a method that identifies clusters using image features suggests that the two approaches perform well on different types of images and concept labels. This suggests that an image annotation framework should integrate information from more than one technique to handle heterogeneous images. We also investigate the performance of the proposed aligner for dermatological primary morphology concept labels, as well as for lesion size or type and distribution-based categories of images.
Multimodal integration of visual and linguistic data is a longstanding but crucial challenge for modeling human understanding. We propose a framework that uses an unsupervised bitext alignment method to integrate visual and linguistic data. We present an empirical study of the various parameters of the framework. Our results exceed baselines using both exact and delayed temporal correspondence. The resulting alignments can be used for image classification and retrieval.
The amount of digital medical image data is increasing rapidly in terms of both quantity and heterogeneity. There exists a great need to format medical image archives so as to facilitate diagnostics and preventive medicine. To achieve this, in the past few decades great efforts have been made to investigate methods of applying content-based image retrieval (CBIR) techniques to retrieve images. However, several critical challenges remain. Recently, CBIR research has become intertwined with the fundamental problem of image understanding and it is recognized that computing solutions that bridge the "semantic gap" must capture higher-level domain knowledge of medical end users. We are investigating the incorporation of state-of-the-art visual categorization techniques into conventional CBIR approaches. Visual attention deployment strategies of medical experts serve as an objective measure to help us understand the perceptual and conceptual processes involved in identifying key visual features and selecting diagnostic regions of the images. Understanding these processes will inform and direct feature selection approaches on medical images, such as the dermatological images used in our study. We also explore systematic and effective information integration methods of image data and semantic descriptions with the long-term goals of building efficient human-centered multi-modal interactive CBIR systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.