Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.408
|View full text |Cite
|
Sign up to set email alerts
|

Decoding Language Spatial Relations to 2D Spatial Arrangements

Abstract: We address the problem of multimodal spatial understanding by decoding a set of languageexpressed spatial relations to a set of 2D spatial arrangements in a multi-object and multirelationship setting. We frame the task as arranging a scene of clip-arts given a textual description. We propose a simple and effective model architecture SPATIAL-REASONING BERT (SR-BERT), trained to decode text to 2D spatial arrangements in a non-autoregressive manner. SR-BERT can decode both explicit and implicit language to 2D spa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 34 publications
0
4
0
Order By: Relevance
“…In HICO-DET the gain is less outspoken but note that because of their architecture CNN features already encode some spatial information. 11 A multi-layer perceptron provided similar results as a classifier. Late fusion was additionally considered, which fared slightly worse than early fusion with either type of classifier.…”
Section: Spatial Plus Visualmentioning
confidence: 85%
“…In HICO-DET the gain is less outspoken but note that because of their architecture CNN features already encode some spatial information. 11 A multi-layer perceptron provided similar results as a classifier. Late fusion was additionally considered, which fared slightly worse than early fusion with either type of classifier.…”
Section: Spatial Plus Visualmentioning
confidence: 85%
“…When predicting the pose offset for an object, it only has information about the language instruction and previously predicted objects. This baseline is similar to the previous transformer models used in floor plan generation and clip-art generation work [51], [52], where the placements of objects are less spatially constrained. • No Structure: This variant of our pose generator directly predicts 6-DoF pose offset of each object in the world frame without predicting and using the virtual structure frame.…”
Section: A Model Component Testingmentioning
confidence: 91%
“…The applications of MHA in computer vision are rapidly expanding [22]. So far, MHA has been applied in conjunction with a CNN [6,15,31,42], as a stand-alone MHA over low level, raw image pixels [5,10,13,47], for vision + text tasks [28,43], or tasks involving spatial reasoning [36] to name a few. In contrast, we apply MHA: (i) over high level, abstract, spatio-temporal layouts, and (ii) to fuse the features of two distinct modalities (layout and appearance).…”
Section: Related Workmentioning
confidence: 99%
“…Some of the limitations include: (i) With a two-branch model, fusion is performed by concatenation, not fully exploiting the complementarity of the spatio-temporal layouts and the video appearance [4,30,49], and (ii) the layout module is treated as a peripheral component [20,49], so it remains unclear to what extent in different evaluation settings (compositional, few-shot, background cluttered videos), a well assembled layout-based model can recognize human actions. At the same time, a multi-head attention model [45] has been demonstrated to be a powerful common-sense reasoning tool over sets of spatially distributed objects in images for visual question-answering [28,43], layout generation [36], etc. By applying multiple heads of beyond-pairwise spatial reasoning, it encapsulates the scene's global spatial context, which is indicative of its semantics, to a certain extent.…”
Section: Introductionmentioning
confidence: 99%