Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Agrawal, Aishwarya; Batra, Dhruv; Parikh, Devi; Kembhavi, Aniruddha

doi:10.1109/cvpr.2018.00522

Cited by 465 publications

(719 citation statements)

References 25 publications

Supporting

Mentioning

714

Contrasting

Order By: Relevance

“…A similar phenomenon was observed in reading comprehension, where systems performed non-trivially well by using only the final sentence in the passage or ignoring the passage altogether (Kaushik & Lipton, 2018). Finally, multiple studies found nontrivial performance in visual question answering (VQA) by using only the question, without access to the image, due to question biases Kafle & Kanan, 2016, 2017Goyal et al, 2017;Agrawal et al, 2017).…”

Section: Fine-tuning On Target Datasetsmentioning

confidence: 58%

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Belinkov¹,

Poliak²,

Shieber³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Natural Language Inference (NLI) datasets often contain hypothesis-only biases-artifacts that allow models to achieve non-trivial performance without learning whether a premise entails a hypothesis. We propose two probabilistic methods to build models that are more robust to such biases and better transfer across datasets. In contrast to standard approaches to NLI, our methods predict the probability of a premise given a hypothesis and NLI label, discouraging models from ignoring the premise. We evaluate our methods on synthetic and existing NLI datasets by training on datasets containing biases and testing on datasets containing no (or different) hypothesis-only biases. Our results indicate that these methods can make NLI models more robust to dataset-specific artifacts, transferring better than a baseline architecture in 9 out of 12 NLI datasets. Additionally, we provide an extensive analysis of the interplay of our methods with known biases in NLI datasets, as well as the effects of encouraging models to ignore biases and fine-tuning on target datasets. 1 * * Equal contribution 1 Our code is available at https://github.com/ azpoliak/robust-nli.2 This hypothesis contradicts the premise and would likely not be inferred.

show abstract

Section: Fine-tuning On Target Datasetsmentioning

confidence: 58%

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Belinkov¹,

Poliak²,

Shieber³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…It could also be interesting to extend this generic approach to other forms of captioning such as visual storytelling [38] or stylized captioning [56] by utilizing the easily available and weakly labelled data from the web. 1…”

Section: Resultsmentioning

confidence: 99%

“…In [11], the authors define that two captions are different if the ratio of common words between them is smaller than a threshold (3% used in the paper). In [3], from the set of all the candidate captions, the authors compute the number of unique n-grams (1,2,4) at each position starting from the beginning up to position 13. We plot diversity using [11] in Figure 5d.…”

Section: Diversitymentioning

confidence: 99%

Aesthetic Image Captioning From Weakly-Labelled Photographs

Ghosal

Rana

Smolić

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Aesthetic image captioning (AIC) refers to the multimodal task of generating critical textual feedbacks for photographs. While in natural image captioning (NIC), deep models are trained in an end-to-end manner using large curated datasets such as MS-COCO, no such large-scale, clean dataset exists for AIC. Towards this goal, we propose an automatic cleaning strategy to create a benchmarking AIC dataset, by exploiting the images and noisy comments easily available from photography websites. We propose a probabilistic caption-filtering method for cleaning the noisy web-data, and compile a large-scale, clean dataset 'AVA-Captions', ( ∼ 230, 000 images with ∼ 5 captions per image). Additionally, by exploiting the latent associations between aesthetic attributes, we propose a strategy for training a convolutional neural network (CNN) based visual feature extractor, typically the first component of an AIC framework. The strategy is weakly supervised and can be effectively used to learn rich aesthetic representations, without requiring expensive ground-truth annotations. We finally showcase a thorough analysis of the proposed contributions using automatic metrics and subjective evaluations.

show abstract

“…As a consequence, even "blind" model can achieve satisfying results without well understanding the questions and images. Many efforts, such as building more balanced datasets [120], [121] and enforcing more transparent model designs, have been made to alleviate this issue. Multi-modal fusion.…”

Section: B Exemplar Applications Of Data and Knowledge Fusionmentioning

confidence: 99%

Multi-Modal Deep Analysis for Multimedia

Zhu

Wang

Zhang

2020

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

With the rapid development of Internet and multimedia services in the past decade, a huge amount of usergenerated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2)multi-modal data and knowledge fusion: multi-modal fusion of data with domain knowledge. More specifically, on datadriven correlational representation, we highlight three important categories of methods, such as multi-modal deep representation, multi-modal transfer learning, and multi-modal hashing. On knowledge-guided fusion, we discuss the approaches for fusing knowledge with data and four exemplar applications that require various kinds of domain knowledge, including multi-modal visual question answering, multi-modal video summarization, multimodal visual pattern mining and multi-modal recommendation. Finally, we bring forward our insights and future research directions.

show abstract

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

Cited by 465 publications

References 25 publications

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference

Aesthetic Image Captioning From Weakly-Labelled Photographs

Multi-Modal Deep Analysis for Multimedia

Contact Info

Product

Resources

About