Describing and Localizing Multiple Changes with Transformers

Qiu, Yue; Yamamoto, Shozo; Nakashima, Kodai; Suzuki, Ryoichi; Iwata, Kenji; Kataoka, Hirokatsu; Satoh, Yutaka

doi:10.1109/iccv48922.2021.00198

Cited by 39 publications

(15 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To date, no prior research has specifically addressed the "image difference question answering" problem. Only a few studies have focused on the general image difference caption task, such as MMCFormers [25] and IDCPCL [31]. Therefore, our work serves as the first step in this new direction and provides a valuable contribution to the research community.…”

Section: Baselinesmentioning

confidence: 97%

“…Within the language generation and vision research domain, the most related works to the medical image difference VQA task is image difference captioning [20,25,31], which is designed to identify object movements and changes within a spatial context such as a static or complex background. As shown in the left Fig.…”

Section: Anatomical Structure-aware Graph Construction and Feature Le...mentioning

confidence: 99%

“…2.MCCFormers is proposed to handle the image difference captioning task [25]. It achieved state-of-the-art performance on the CLEVR-Change dataset [22], a famous image difference captioning dataset.…”

Section: Baselinesmentioning

confidence: 99%

See 2 more Smart Citations

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

et al. 2023

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

To contribute to automating the medical vision-language model, we propose a novel Chest-Xray Difference Visual Question Answering (VQA) task. Given a pair of main and reference images, this task attempts to answer several questions on both diseases and, more importantly, the differences between them. This is consistent with the radiologist's diagnosis practice that compares the current image with the reference before concluding the report. We collect a new dataset, namely MIMIC-Diff-VQA, including 700,703 QA pairs from 164,324 pairs of main and reference images. Compared to existing medical VQA datasets, our questions are tailored to the Assessment-Diagnosis-Intervention-Evaluation treatment procedure used by clinical professionals. Meanwhile, we also propose a novel expert knowledge-aware graph representation learning model to address this task. The proposed baseline model leverages expert knowledge such as anatomical structure prior, semantic, and spatial knowledge to construct a multi-relationship graph, representing the image differences between two images for the image difference VQA task. The dataset and code can be found at https://github.com/Holipori/MIMIC-Diff-VQA. We believe this work would further push forward the medical vision language model.

show abstract

Section: Baselinesmentioning

confidence: 97%

Section: Anatomical Structure-aware Graph Construction and Feature Le...mentioning

confidence: 99%

See 1 more Smart Citation

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

et al. 2023

Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

show abstract

“…Most of the existing works in the change detection literature belong to this category. [18,17,21,12] tackle the change captioning problem where the goal is to describe the changes in an image pair in natural language. These methods mainly evaluate their approach on the STD [10] (images from fixed video surveillance camera), or CLEVR-based change datasets [18,21,12] (synthetic images of 3D objects of primitive shapes).…”

Section: Related Workmentioning

confidence: 99%

The Change You Want to See

Sachdeva

Zisserman

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

“…According to the number of input images, they can be divided into two categories: Two-image based and Groupbased captioning. Two-image based captioning tends to describe the common [36] or different [28,30,37,49] parts between the two images. Thus, the two images in their settings always have strong correlations.…”

Section: Related Workmentioning

confidence: 99%

Rethinking the Reference-based Distinctive Image Captioning

Mao,

Chen,

Jiang

et al. 2022

Preprint

View full text Add to dashboard Cite

Distinctive Image Captioning (DIC) -generating distinctive captions that describe the unique details of a target image -has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. For example, if the target image contains objects "towel" and "toilet" while all reference images are without them, then a simple caption "A bathroom with a towel and a toilet" is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute-level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

show abstract

Describing and Localizing Multiple Changes with Transformers

Cited by 39 publications

References 34 publications

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering

The Change You Want to See

Rethinking the Reference-based Distinctive Image Captioning

Contact Info

Product

Resources

About