Group-based Distinctive Image Captioning with Memory Attention

Wang, Jiuniu; Xu, Wenjia; Wang, Qingzhong; Chan, Antoni B.

doi:10.1145/3474085.3475215

Cited by 16 publications

(24 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wang et al. [59] improved the distinctiveness of image captions using a group‐based distinctive captioning model, and proposed a new evaluation metric DisWordRate to measure the distinctiveness of captions.…”

Section: The Recent Deep Learning Methodsmentioning

confidence: 99%

A thorough review of models, evaluation metrics, and datasets on image captioning

Luo

Cheng

Chao

et al. 2021

IET Image Processing

View full text Add to dashboard Cite

Image captioning means generate descriptive sentences from a query image automatically. It has recently received widespread attention from the computer vision and natural language processing communities as an emerging visual task. Currently, both components have evolved considerably by exploiting object regions, attributes, attention mechanism methods, entity recognition with novelties, and training strategies. However, despite the impressive results, the research has not yet come to a conclusive answer. This survey aims to provide a comprehensive overview of image captioning methods, from technical architectures to benchmark datasets, evaluation metrics, and comparison of state-of-theart methods. In particular, image captioning methods are divided into different categories based on the technique adopted. Representative methods in each class are summarized, and their advantages and limitations are discussed. Moreover, many related state-of-the-art studies were quantitatively compared to determine the recent trends and future directions in image captioning. The ultimate goal of this work is to serve as a tool for understanding the existing literature and highlighting future directions in the area of image captioning for Computer Vision and Natural Language Processing communities may benefit from.

show abstract

Section: The Recent Deep Learning Methodsmentioning

confidence: 99%

A thorough review of models, evaluation metrics, and datasets on image captioning

Luo

Cheng

Chao

et al. 2021

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…The multiperspective captioning task [236] considers text and image inputs from different viewpoints. The distinctive image captioning task [237] is defined to describe the unique object or context of an image to distinguish it from other semantically similar images. The diverse caption and rich image generation task [238] proposes a bidirectional image and text generation task to align rich images and their corresponding multiple different titles, aiming to achieve multiple sentences from one image uniformly and multiple sentences to generate more suitable images.…”

Section: Multimodal Task Diversitymentioning

confidence: 99%

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Chai

Wang

2022

Applied Sciences

View full text Add to dashboard Cite

Deep vision multimodal learning aims at combining deep visual representation learning with other modalities, such as text, sound, and data collected from other sensors. With the fast development of deep learning, vision multimodal learning has gained much interest from the community. This paper reviews the types of architectures used in multimodal learning, including feature extraction, modality aggregation, and multimodal loss functions. Then, we discuss several learning paradigms such as supervised, semi-supervised, self-supervised, and transfer learning. We also introduce several practical challenges such as missing modalities and noisy modalities. Several applications and benchmarks on vision tasks are listed to help researchers gain a deeper understanding of progress in the field. Finally, we indicate that pretraining paradigm, unified multitask framework, missing and noisy modality, and multimodal task diversity could be the future trends and challenges in the deep vision multimodal learning field. Compared with existing surveys, this paper focuses on the most recent works and provides a thorough discussion of methodology, benchmarks, and future trends.

show abstract

“…For human-like distinctive captioning, a recent work proposes to study the DIC task based on a group of semantic-similar reference images, dubbed Reference-based DIC (Ref-DIC). Different [43]. (b): Selected reference images for the same target image using our two-stage matching mechanism.…”

Section: Introductionmentioning

confidence: 99%

“…Compared to Single-DIC, the generated captions are only asked to distinguish the target image from the group of reference images, i.e., grouplevel distinctiveness. Unfortunately, the reference images used in existing Ref-DIC works [43] can be trivially distinguished: these reference images only resemble the target image at the scene-level and have few common objects, thus Ref-DIC models can simply generate distinctive captions even without considering the reference images. For example in Figure 1 (a), target and reference images have no object in common (e.g., "towel", "shower curtain", or "toilet"), each object in target image is unique, such that the Ref-DIC model can trivially generate "a bathroom with a towel" to tell the target and reference images apart.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we propose two new benchmarks for the Ref-DIC task: COCO-DIC and Flickr30K-DIC. To strictly control the unique details between target and reference images, we propose a two-stage matching mechanism, which can measure image similarity at the object-/attribute-level (vs. scene-level in [43]), and deliberately make target and reference images have some common objects. Under this mechanism, Ref-DIC models can learn to focus on the unique attributes and objects in target image.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Rethinking the Reference-based Distinctive Image Captioning

Mao,

Chen,

Jiang

et al. 2022

Preprint

View full text Add to dashboard Cite

Distinctive Image Captioning (DIC) -generating distinctive captions that describe the unique details of a target image -has received considerable attention over the last few years. A recent DIC work proposes to generate distinctive captions by comparing the target image with a set of semantic-similar reference images, i.e., reference-based DIC (Ref-DIC). It aims to make the generated captions can tell apart the target and reference images. Unfortunately, reference images used by existing Ref-DIC works are easy to distinguish: these reference images only resemble the target image at scene-level and have few common objects, such that a Ref-DIC model can trivially generate distinctive captions even without considering the reference images. For example, if the target image contains objects "towel" and "toilet" while all reference images are without them, then a simple caption "A bathroom with a towel and a toilet" is distinctive enough to tell apart target and reference images. To ensure Ref-DIC models really perceive the unique objects (or attributes) in target images, we first propose two new Ref-DIC benchmarks. Specifically, we design a two-stage matching mechanism, which strictly controls the similarity between the target and reference images at object-/attribute-level (vs. scene-level). Secondly, to generate distinctive captions, we develop a strong Transformer-based Ref-DIC baseline, dubbed as TransDIC. It not only extracts visual features from the target image, but also encodes the differences between objects in the target and reference images. Finally, for more trustworthy benchmarking, we propose a new evaluation metric named DisCIDEr for Ref-DIC, which evaluates both the accuracy and distinctiveness of the generated captions. Experimental results demonstrate that our TransDIC can generate distinctive captions. Besides, it outperforms several state-of-the-art models on the two new benchmarks over different metrics.

show abstract

Group-based Distinctive Image Captioning with Memory Attention

Cited by 16 publications

References 34 publications

A thorough review of models, evaluation metrics, and datasets on image captioning

A thorough review of models, evaluation metrics, and datasets on image captioning

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Rethinking the Reference-based Distinctive Image Captioning

Contact Info

Product

Resources

About