2020
DOI: 10.48550/arxiv.2001.06354
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Modality-Balanced Models for Visual Dialogue

Abstract: The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversation context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(1 citation statement)
references
References 18 publications
0
1
0
Order By: Relevance
“…Audio-visual learning: similar to its applications in natural language processing (NLP) and visual questing & answering systems [Hannan et al 2020;Kim et al 2020Kim et al , 2016, multi-modal learning using both audio-visual sensory inputs has also been used for classification tasks [Sterling et al 2018;Wilson et al 2019], audio-visual zooming [Nair et al 2019, and sound source separation [Ephrat et al 2018;Lee and Seung 2000] which have also isolated waves for specific generation tasks. Although similar in spirit, our audio-visual method, "Echoreconstruction, " differs from the existing methods by learning absorption and reflectance properties to detect a reflective surface, its depth, and material.…”
Section: Acoustic Imaging and Audio-based Classifiersmentioning
confidence: 99%
“…Audio-visual learning: similar to its applications in natural language processing (NLP) and visual questing & answering systems [Hannan et al 2020;Kim et al 2020Kim et al , 2016, multi-modal learning using both audio-visual sensory inputs has also been used for classification tasks [Sterling et al 2018;Wilson et al 2019], audio-visual zooming [Nair et al 2019, and sound source separation [Ephrat et al 2018;Lee and Seung 2000] which have also isolated waves for specific generation tasks. Although similar in spirit, our audio-visual method, "Echoreconstruction, " differs from the existing methods by learning absorption and reflectance properties to detect a reflective surface, its depth, and material.…”
Section: Acoustic Imaging and Audio-based Classifiersmentioning
confidence: 99%