Proceedings of the 13th International Conference on Natural Language Generation 2020
DOI: 10.18653/v1/2020.inlg-1.38
|View full text |Cite
|
Sign up to set email alerts
|

From “Before” to “After”: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain

Robin Rojowiec,
Jana Götze,
Philipp Sadler
et al.

Abstract: While certain types of instructions can be compactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before/After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. ( 2016) and investigate whe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…We present a transformer-based generation model with a simple but novel difference attention head designed to visually ground complex locative expressions and target-landmark references in image pairs. We show that our model clearly exceeds the performance of Rojowiec et al (2020)'s existing baseline models on this task, in greatly improving the accuracy of generated target and landmark references. In contrast to other recent instruction generation models (Fried et al, 2017;Köhn et al, 2020;Schumann and Riezler, 2021), our approach does not use any symbolic representations of scene states and trajectories.…”
Section: Introductionmentioning
confidence: 77%
See 1 more Smart Citation
“…We present a transformer-based generation model with a simple but novel difference attention head designed to visually ground complex locative expressions and target-landmark references in image pairs. We show that our model clearly exceeds the performance of Rojowiec et al (2020)'s existing baseline models on this task, in greatly improving the accuracy of generated target and landmark references. In contrast to other recent instruction generation models (Fried et al, 2017;Köhn et al, 2020;Schumann and Riezler, 2021), our approach does not use any symbolic representations of scene states and trajectories.…”
Section: Introductionmentioning
confidence: 77%
“…For landmarks, there might be several blocks mentioned by different crowd-workers. Since the blocks are generally referred to their logos, the targets in BLOCKS can be detected in human and generated captions with a simple, rule-based instruction parser (Rojowiec et al, 2020). In Spot-the-diff, there might be several target objects referred to by a more complex vocabulary, e.g.…”
Section: Training and Hyperparametersmentioning
confidence: 99%