2022
DOI: 10.33774/coe-2022-dfw80
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning Functional Distributional Semantics with Visual Data

Abstract: Functional Distributional Semantics is a recently proposed framework for learning distributional semantics that provides linguistic interpretability. It models the meaning of a word as a binary classifier rather than a numerical vector. In this work, we propose a method to train a Functional Distributional Semantics model with grounded visual data. We train it on the Visual Genome dataset, which is closer to the kind of data encountered in human language acquisition than a large text corpus. On four external e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(4 citation statements)
references
References 3 publications
0
4
0
Order By: Relevance
“…However, this is inefficient due to the low bandwidth of natural language communication: given the diversity of real-world visual tasks, describing all of the potential task-relevant information within a single image requires a huge amount of language tokens. Therefore, many efforts opt to connect compact latent visual representations through a dense connector by visual instruction tuning, such as MiniGPT-4 (Zhu et al 2023), LLaVA (Liu, Emerson, and Collier 2022), Multimodal-GPT (Gong et al 2023), LLaMA-Adapter (Zhang et al 2023), Otter ), mPLUG-Owl (Ye et al 2023), InstructBLIP (Dai et al 2023). These models use linear projectors or perceivers as the connector between visual models and LLM, thus having a much larger information bandwidth compared to those prompt-based natural language communications.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations
“…However, this is inefficient due to the low bandwidth of natural language communication: given the diversity of real-world visual tasks, describing all of the potential task-relevant information within a single image requires a huge amount of language tokens. Therefore, many efforts opt to connect compact latent visual representations through a dense connector by visual instruction tuning, such as MiniGPT-4 (Zhu et al 2023), LLaVA (Liu, Emerson, and Collier 2022), Multimodal-GPT (Gong et al 2023), LLaMA-Adapter (Zhang et al 2023), Otter ), mPLUG-Owl (Ye et al 2023), InstructBLIP (Dai et al 2023). These models use linear projectors or perceivers as the connector between visual models and LLM, thus having a much larger information bandwidth compared to those prompt-based natural language communications.…”
Section: Related Workmentioning
confidence: 99%
“…However, what data is optimal for training these connectors to ensure that they propagate visual information faithfully is unclear. Existing attempts include generating self-instruct (Wang et al 2022b) data (i.e., LLaVA (Liu, Emerson, and Collier 2022)), using image-text captioning datasets (e.g., COCO (Chen et al 2015), SBU (Ordonez, Kulkarni, and Berg 2011), CC-3M (Sharma et al 2018)), and unifying downstream visionlanguage datasets (e.g., VQA and visual reasoning datasets). Although GPT-4 generated LLaVA dataset enjoy very high quality, its scale remains insufficient, and it could not encourage fine-grained vision-language alignment, as it does not "make V in VQA matter" (Goyal et al 2017).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations