2021
DOI: 10.48550/arxiv.2103.13942
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Visual Grounding Strategies for Text-Only Natural Language Processing

Abstract: Visual grounding is a promising path toward more robust and accurate Natural Language Processing (NLP) models. Many multimodal extensions of BERT (e.g., VideoBERT, LXMERT, VL-BERT) allow a joint modeling of texts and images that lead to state-of-theart results on multimodal tasks such as Visual Question Answering. Here, we leverage multimodal modeling for purely textual tasks (language modeling and classification) with the expectation that the multimodal pretraining provides a grounding that can improve text p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 27 publications
0
4
0
Order By: Relevance
“…(transformers) on current sentence-level language tasks is still under debate (Yun et al, 2021;Iki and Aizawa, 2021;Tan and Bansal, 2020). While some approaches report slight improvements (Sileo, 2021), it is mostly believed that visually grounded transformer models such as VL-BERT (Su et al, 2019) not only bring no improvements for language tasks but they might distort the linguistic knowledge obtained from textual corpora for solving the natural language understanding tasks (Tan and Bansal, 2020;Yun et al, 2021). The main backbone of all transformers is stacking multiple attention layers (Vaswani et al, 2017), briefly explained in Section 6.…”
Section: Contextualized Visual Groundingmentioning
confidence: 99%
“…(transformers) on current sentence-level language tasks is still under debate (Yun et al, 2021;Iki and Aizawa, 2021;Tan and Bansal, 2020). While some approaches report slight improvements (Sileo, 2021), it is mostly believed that visually grounded transformer models such as VL-BERT (Su et al, 2019) not only bring no improvements for language tasks but they might distort the linguistic knowledge obtained from textual corpora for solving the natural language understanding tasks (Tan and Bansal, 2020;Yun et al, 2021). The main backbone of all transformers is stacking multiple attention layers (Vaswani et al, 2017), briefly explained in Section 6.…”
Section: Contextualized Visual Groundingmentioning
confidence: 99%
“…Visual grounding has also been explored for machine translation; for instance, Elliott and Kádár (2017) add an auxiliary visual prediction loss in addition to the regular seq2seq objective which is shown to improve performance. More recently, Sileo (2021) investigates the extent to which visual-linguistic pretraining of multimodal transformers can improve performance on a set of text-only tasks. While these approaches suggest that visual grounding can be helpful for language tasks, our work more explicitly targets the question of how the additional modality can complement the textual signal.…”
Section: Visual Grounding For Improved Nlpmentioning
confidence: 99%
“…There is previous work investigating the potential of leveraging multimodal information during training to enable a model to generate or retrieve additional multimodal information at inference time for a pure text input. Sileo (2021) uses the term associative grounding, which can be based on synthesis or retrieval. The main difference between our work and Sileo's is that he develops a model based on retrieval, while we use feature synthesis.…”
Section: Augmenting Input Using Feature Predictionmentioning
confidence: 99%
See 1 more Smart Citation