Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475215
|View full text |Cite
|
Sign up to set email alerts
|

Group-based Distinctive Image Captioning with Memory Attention

Abstract: Describing images using natural language is widely known as image captioning, which has made consistent progress due to the development of computer vision and natural language generation techniques. Though conventional captioning models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employ contrastive learning or re-weighted the gro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 16 publications
(24 citation statements)
references
References 34 publications
0
24
0
Order By: Relevance
“…Wang et al. [59] improved the distinctiveness of image captions using a group‐based distinctive captioning model, and proposed a new evaluation metric DisWordRate to measure the distinctiveness of captions.…”
Section: The Recent Deep Learning Methodsmentioning
confidence: 99%
“…Wang et al. [59] improved the distinctiveness of image captions using a group‐based distinctive captioning model, and proposed a new evaluation metric DisWordRate to measure the distinctiveness of captions.…”
Section: The Recent Deep Learning Methodsmentioning
confidence: 99%
“…The multiperspective captioning task [236] considers text and image inputs from different viewpoints. The distinctive image captioning task [237] is defined to describe the unique object or context of an image to distinguish it from other semantically similar images. The diverse caption and rich image generation task [238] proposes a bidirectional image and text generation task to align rich images and their corresponding multiple different titles, aiming to achieve multiple sentences from one image uniformly and multiple sentences to generate more suitable images.…”
Section: Multimodal Task Diversitymentioning
confidence: 99%
“…For human-like distinctive captioning, a recent work proposes to study the DIC task based on a group of semantic-similar reference images, dubbed Reference-based DIC (Ref-DIC). Different [43]. (b): Selected reference images for the same target image using our two-stage matching mechanism.…”
Section: Introductionmentioning
confidence: 99%
“…Compared to Single-DIC, the generated captions are only asked to distinguish the target image from the group of reference images, i.e., grouplevel distinctiveness. Unfortunately, the reference images used in existing Ref-DIC works [43] can be trivially distinguished: these reference images only resemble the target image at the scene-level and have few common objects, thus Ref-DIC models can simply generate distinctive captions even without considering the reference images. For example in Figure 1 (a), target and reference images have no object in common (e.g., "towel", "shower curtain", or "toilet"), each object in target image is unique, such that the Ref-DIC model can trivially generate "a bathroom with a towel" to tell the target and reference images apart.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation