Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.162
|View full text |Cite
|
Sign up to set email alerts
|

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Abstract: Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a techniq… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
89
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 61 publications
(90 citation statements)
references
References 47 publications
(47 reference statements)
1
89
0
Order By: Relevance
“…Finally, while predicting words across multiple timescales may be an effective learning objective for language acquisition, it is by no means the only feasible objective. Thus, it is likely that the brain relies on additional simple objectives at different timescales to facilitate learning 59, 63 .…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, while predicting words across multiple timescales may be an effective learning objective for language acquisition, it is by no means the only feasible objective. Thus, it is likely that the brain relies on additional simple objectives at different timescales to facilitate learning 59, 63 .…”
Section: Discussionmentioning
confidence: 99%
“…Future studies, however, should assess whether these cognitively plausible, prediction-based, feedback signals are indeed available at a young age as we learn language, and whether the brain can use such predictive signals to guide language acquisition. Further, while next-word prediction may be an effective learning objective for language acquisition, it is not the only feasible objective-the brain may optimize additional simple objectives, at different timescales, to facilitate learning 18,62 .…”
Section: Next-word Predictionmentioning
confidence: 99%
“…Distributional evidence used in extrapolations presented and quoted in this paper belongs to the World Scope of the written world. Bringing in perceptual grounding has-to some extent-been done in the visually supervised language model (Tan & Bansal, 2020), with work on interactive and interpersonal grounding largely still to follow (Bisk et al, 2020, p. 8721-8725).…”
Section: Discussionmentioning
confidence: 99%
“…We refer interested readers in this research direction to Brisk et al [156]. Vision-language learning has recently improved GLUE benchmark performance [157]. ConVIRT [97] is an interesting direction to bring a language "grounding" to medical image processing.…”
Section: Discussionmentioning
confidence: 99%