2021
DOI: 10.1109/lra.2021.3107026
|View full text |Cite
|
Sign up to set email alerts
|

Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 7 publications
(8 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…It introduces linguistic and generation branches to model the relationship between subwords and achieves subword-level attention. Case Relation Transformer [13] is a model that generates fetching instruction sentences including the spatial referring expressions of target objects and destinations. It introduces a transformer-based encoder-decoder architecture to fuse the visual and geometric features of the objects in images.…”
Section: Related Workmentioning
confidence: 99%
“…It introduces linguistic and generation branches to model the relationship between subwords and achieves subword-level attention. Case Relation Transformer [13] is a model that generates fetching instruction sentences including the spatial referring expressions of target objects and destinations. It introduces a transformer-based encoder-decoder architecture to fuse the visual and geometric features of the objects in images.…”
Section: Related Workmentioning
confidence: 99%
“…Numerous studies have been conducted in the field of image captioning (Xu et al, 2015;Herdade et al, 2019;Cornia et al, 2020;Luo et al, 2021;Li et al, 2022), a crucial area of research that has been further extended and applied in the sphere of robotics (Magassouba et al, 2019;Ogura et al, 2020;Kambara et al, 2021). Multi-ABN (Magassouba et al, 2019) is a model for generating fetching instructions for domestic service robots using multiple images from various viewpoints.…”
Section: B Applications Of Image Captioningmentioning
confidence: 99%
“…CRT (Kambara et al, 2021) is a model for generating fetching instructions including the spatial referring expressions of target objects and destinations. It introduces Transformer-based encoder-decoder architecture to fuse the visual and geometric features of the objects in images.…”
Section: B Applications Of Image Captioningmentioning
confidence: 99%
“…Image captioning has been extensively studied and applied to various applications in society, such as generating fetching instructions for robots, assisting blind people, and answering questions from images (Magassouba et al, 2019;Ogura et al, 2020;Kambara et al, 2021;Gurari et al, 2020;White et al, 2021;Fisch et al, 2020). In this field, it is important that the quality of the generated captions is evaluated appropriately.…”
Section: Introductionmentioning
confidence: 99%