Connecting What to Say With Where to Look by Modeling Human Attention Traces

Meng, Zihang; Yu, Licheng; Zhang, Ning; Berg, Tamara L.; Damavandi, Babak; Singh, Vikas; Bearman, Amy

doi:10.1109/cvpr46437.2021.01249

Cited by 20 publications

(8 citation statements)

References 35 publications

(66 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Narrative annotation focuses on the description of the relationship between entities, and entity relationships are collected during the annotation phase. Attributes, relationships, and entities in the same image are often closely related (29)(30)(31)(32). Localized Narratives (30) connect vision and language by artificially using mouse scribing to join action connections between entities and make the captioning in content more hierarchical.…”

Section: Narrative Annotation Modelmentioning

confidence: 99%

PathNarratives: Data annotation for pathological human-AI collaborative diagnosis

Zhang

He²,

Wu³

et al. 2023

Front. Med.

View full text Add to dashboard Cite

Pathology is the gold standard of clinical diagnosis. Artificial intelligence (AI) in pathology becomes a new trend, but it is still not widely used due to the lack of necessary explanations for pathologists to understand the rationale. Clinic-compliant explanations besides the diagnostic decision of pathological images are essential for AI model training to provide diagnostic suggestions assisting pathologists practice. In this study, we propose a new annotation form, PathNarratives, that includes a hierarchical decision-to-reason data structure, a narrative annotation process, and a multimodal interactive annotation tool. Following PathNarratives, we recruited 8 pathologist annotators to build a colorectal pathological dataset, CR-PathNarratives, containing 174 whole-slide images (WSIs). We further experiment on the dataset with classification and captioning tasks to explore the clinical scenarios of human-AI-collaborative pathological diagnosis. The classification tasks show that fine-grain prediction enhances the overall classification accuracy from 79.56 to 85.26%. In Human-AI collaboration experience, the trust and confidence scores from 8 pathologists raised from 3.88 to 4.63 with providing more details. Results show that the classification and captioning tasks achieve better results with reason labels, provide explainable clues for doctors to understand and make the final decision and thus can support a better experience of human-AI collaboration in pathological diagnosis. In the future, we plan to optimize the tools for the annotation process, and expand the datasets with more WSIs and covering more pathological domains.

show abstract

Section: Narrative Annotation Modelmentioning

confidence: 99%

PathNarratives: Data annotation for pathological human-AI collaborative diagnosis

Zhang

He²,

Wu³

et al. 2023

Front. Med.

View full text Add to dashboard Cite

show abstract

“…In order to generate the word-to-box alignment from the provided gaze trace points, our model divides the trace into several boxes, each box associated with a word, and generates a bounding box aligned on the axis of the gaze points. This trace transformation is performed inspired by the mouse trace analysis [17]. After that we extract three kinds of features, visual, captioning and gaze features.…”

Section: Preprocessing For Model Trainingmentioning

confidence: 99%

“…After that we extract three kinds of features, visual, captioning and gaze features. For calculating visual features, we use pre-trained Faster R-CNN [17] provided by detectron2 [18] to compute the visual features of the detected regions. Next, for calculating captioning features, we sum up the positional embeddings and the word embeddings via LXMERT proposed in the previous method [19].…”

Section: Preprocessing For Model Trainingmentioning

confidence: 99%

“…The CGT model, which includes three modules, image encoder, caption encoder-decoder and gaze trace encoder-decoder, is constructed based on the previous method [17]. Figure 2 shows the overview of our proposed CGT model.…”

Section: Connect Caption and Gaze Trace Modelmentioning

confidence: 99%

“…λ * are the balance parameters between all loss functions. This settings are based on the previous method [17].…”

Section: Loss Functionmentioning

confidence: 99%

See 2 more Smart Citations

Human-Centric Image Retrieval with Gaze-Based Image Captioning

Feng

Maeda

Ogawa

et al. 2022

2022 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

This paper presents human-centric image retrieval with gaze-based image captioning. Although the development of cross-modal embedding techniques has enabled advanced image retrieval, many methods have focused only on the information obtained from the contents such as image and text. For further extending the image retrieval, it is necessary to construct retrieval techniques that directly reflect human intentions. In this paper, we propose a new retrieval approach via image captioning based on gaze information by focusing on the fact that the gaze information obtained from humans contains semantic information. Specifically, we construct a transformer, connect caption and gaze trace (CGT) model that learns the relationship among images, captioning provided by humans and gaze traces. Our CGT model enables transformer-based learning by dividing the gaze traces into several bounding boxes, and thus, gaze-based image captioning becomes feasible. By using the obtained captioning for cross-modal retrieval, we can achieve human-centric image retrieval. The technical contribution of this paper is transforming the gaze trace into the captioning via the transformer-based encoder. In the experiments, by comparing the cross-modal embedding method, the effectiveness of the proposed method is proved.

show abstract