Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

Kambara, Motonari; Sugiura, Komei

doi:10.1109/lra.2021.3107026

Cited by 7 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It introduces linguistic and generation branches to model the relationship between subwords and achieves subword-level attention. Case Relation Transformer [13] is a model that generates fetching instruction sentences including the spatial referring expressions of target objects and destinations. It introduces a transformer-based encoder-decoder architecture to fuse the visual and geometric features of the objects in images.…”

Section: Related Workmentioning

confidence: 99%

Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms

Ishikawa

Sugiura

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

Within the museum community, the automatic generation of artwork description is expected to accelerate the improvement of accessibility for visually impaired visitors. Captions that describe artworks should be based on emotions because art is inseparable from viewers' emotional reactions. By contrast, artworks typically do not have unique interpretations; thus, it is difficult for systems to reflect the specified emotions in captions precisely. Most existing methods attempt to leverage predicted emotion labels from images to generate emotion-oriented captions; however, they do not allow users to specify arbitrary emotions. We propose an affective visual encoder, which integrates emotion attributes and cross-modal joint features of images into visual information over all encoder blocks. Moreover, we introduce affective tokens that fuse grid-and region-based image features to cover both contextual and object-level information. We validated our method on the ArtEmis dataset, and the results demonstrated that our method outperformed baseline methods on all metrics in the emotion-conditioned task.

show abstract

Section: Related Workmentioning

confidence: 99%

Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms

Ishikawa

Sugiura

2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

show abstract

“…Numerous studies have been conducted in the field of image captioning (Xu et al, 2015;Herdade et al, 2019;Cornia et al, 2020;Luo et al, 2021;Li et al, 2022), a crucial area of research that has been further extended and applied in the sphere of robotics (Magassouba et al, 2019;Ogura et al, 2020;Kambara et al, 2021). Multi-ABN (Magassouba et al, 2019) is a model for generating fetching instructions for domestic service robots using multiple images from various viewpoints.…”

Section: B Applications Of Image Captioningmentioning

confidence: 99%

“…CRT (Kambara et al, 2021) is a model for generating fetching instructions including the spatial referring expressions of target objects and destinations. It introduces Transformer-based encoder-decoder architecture to fuse the visual and geometric features of the objects in images.…”

Section: B Applications Of Image Captioningmentioning

confidence: 99%

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models

Wada,

Kaneda,

Sugiura

2023

Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

View full text Add to dashboard Cite

Image captioning studies heavily rely on automatic evaluation metrics such as BLEU and METEOR. However, such n-gram-based metrics have been shown to correlate poorly with human evaluation, leading to the proposal of alternative metrics such as SPICE for English; however, no equivalent metrics have been established for other languages. Therefore, in this study, we propose an automatic evaluation metric called JaSPICE, which evaluates Japanese captions based on scene graphs. The proposed method generates a scene graph from dependencies and the predicate-argument structure, and extends the graph using synonyms. We conducted experiments employing 10 image captioning models trained on STAIR Captions and PFN-PIC and constructed the Shichimi dataset, which contains 103,170 human evaluations. The results showed that our metric outperformed the baseline metrics for the correlation coefficient with the human evaluation.

show abstract

“…Image captioning has been extensively studied and applied to various applications in society, such as generating fetching instructions for robots, assisting blind people, and answering questions from images (Magassouba et al, 2019;Ogura et al, 2020;Kambara et al, 2021;Gurari et al, 2020;White et al, 2021;Fisch et al, 2020). In this field, it is important that the quality of the generated captions is evaluated appropriately.…”

Section: Introductionmentioning

confidence: 99%

Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

2023

View full text Add to dashboard Cite

Language models (LMs) have been argued to overlap substantially with human beings in grammaticality judgment tasks. But when humans systematically make errors in language processing, should we expect LMs to behave like cognitive models of language and mimic human behavior? We answer this question by investigating LMs' more subtle judgments associated with "language illusions" -sentences that are vague in meaning, implausible, or ungrammatical but receive unexpectedly high acceptability judgments by humans. We looked at three illusions: the comparative illusion (e.g. "More people have been to Russia than I have"), the depth-charge illusion (e.g. "No head injury is too trivial to be ignored"), and the negative polarity item (NPI) illusion (e.g. "The hunter who no villager believed to be trustworthy will ever shoot a bear"). We found that probabilities represented by LMs were more likely to align with human judgments of being "tricked" by the NPI illusion which examines a structural dependency, compared to the comparative and the depth-charge illusions which require sophisticated semantic understanding. No single LM or metric yielded results that are entirely consistent with human behavior. Ultimately, we show that LMs are limited both in their construal as cognitive models of human language processing and in their capacity to recognize nuanced but critical information in complicated language materials.

show abstract

Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

Cited by 7 publications

References 17 publications

Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms

Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models

Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Contact Info

Product

Resources

About