MIMOQA: Multimodal Input Multimodal Output Question Answering

Singh, Hrituraj; Nasery, Anshul; Mehta, Denil; Agarwal, Aishwarya; Lamba, Jatin; Srinivasan, Balaji Vasan

doi:10.18653/v1/2021.naacl-main.418

Cited by 15 publications

(7 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MANYMODALQA (Hannan et al, 2020) requires reasoning over prior knowledge, images, and databases. MIMOQA (Singh et al, 2021b) is an example of multimodal responses, where answers are image-text pairs.…”

Section: Antol Et Al (2015) Andmentioning

confidence: 99%

Multimodal Conversational AI: A Survey of Datasets and Approaches

Sundar¹,

Heck²

2022

Proceedings of the 4th Workshop on NLP for Conversational AI

View full text Add to dashboard Cite

As humans, we experience the world with all our senses or modalities (sound, sight, touch, smell, and taste). We use these modalities, particularly sight and touch, to convey and interpret specific meanings. Multimodal expressions are central to conversations; a rich set of modalities amplify and often compensate for each other. A multimodal conversational AI system answers questions, fulfills tasks, and emulates human conversations by understanding and expressing itself via multiple modalities. This paper motivates, defines, and mathematically formulates the multimodal conversational research objective. We provide a taxonomy of research required to solve the objective: multimodal representation, fusion, alignment, translation, and co-learning. We survey state-of-the-art datasets and approaches for each research area and highlight their limiting assumptions. Finally, we identify multimodal co-learning as a promising direction for multimodal conversational AI research.

show abstract

Section: Antol Et Al (2015) Andmentioning

confidence: 99%

Multimodal Conversational AI: A Survey of Datasets and Approaches

Sundar¹,

Heck²

2022

Proceedings of the 4th Workshop on NLP for Conversational AI

View full text Add to dashboard Cite

show abstract

“…Even though there are some available multimodal QA datasets in non-clinical domains (Hannan et al, 2020;Chen et al, 2020;Talmor et al, 2021), but there are no existing multimodal QA datasets which uses structured with unstructured EHR data to answer questions. There are some existing works in the clinical genre on multimodal understanding from text-image pairs (Moon et al, 2021;Khare et al, 2021;Li et al, 2020) as well as clinical QA (Singh et al, 2021) on text-image data. But as far as the authors' knowledge, so far there is no multi-modal clinical dataset that encorporates structured and unstructured EHR data for QA.…”

Section: Related Workmentioning

confidence: 99%

DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

Bardhan¹,

Colas²,

Roberts³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper develops the first question answering dataset (DrugEHRQA) containing question-answer pairs from both structured tables and unstructured notes from a publicly available Electronic Health Record (EHR). EHRs contain patient records, stored in structured tables and unstructured clinical notes. The information in structured and unstructured EHRs is not strictly disjoint: information may be duplicated, contradictory, or provide additional context between these sources. Our dataset has medication-related queries, containing over 70,000 question-answer pairs. To provide a baseline model and help analyze the dataset, we have used a simple model (MultimodalEHRQA) which uses the predictions of a modality selection network to choose between EHR tables and clinical notes to answer the questions. This is used to direct the questions to the table-based or text-based state-of-the-art QA model. In order to address the problem arising from complex, nested queries, this is the first time Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers (RAT-SQL) has been used to test the structure of query templates in EHR data. Our goal is to provide a benchmark dataset for multi-modal QA systems, and to open up new avenues of research in improving question answering over EHR structured data by using context from unstructured clinical data.

show abstract

“…Later, OK-VQA (Marino et al, 2019) enlarged VQA's scope to annotate questions requiring both image and implicit textual/common-sense knowledge to answer. More recently, MuMuQA (Reddy et al, 2021), ManyModelQA (Hannan et al, 2020) and MIMOQA (Singh et al, 2021) provide questions which require reasoning over images and explicitly provided text snippets. However, these datasets are restricted to dealing with given text and images without requiring any retrieval from the web: they are analogous to machine-reading approaches to QA from text like SQuAD, rather than open-book QA.…”

Section: Related Workmentioning

confidence: 99%

MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text

Chen¹,

Hu²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

While language Models store a massive amount of world knowledge implicitly in their parameters, even very large models often fail to encode information about rare entities and events, while incurring huge computational costs. Recently, retrieval-augmented models, such as REALM, RAG, and RETRO, have incorporated world knowledge into language generation by leveraging an external nonparametric index and have demonstrated impressive performance with constrained model sizes. However, these methods are restricted to retrieving only textual knowledge, neglecting the ubiquitous amount of knowledge in other modalities like images -much of which contains information not covered by any text. To address this limitation, we propose the first Multimodal Retrieval-Augmented Transformer (MuRAG), which accesses an external non-parametric multimodal memory to augment language generation. MuRAG is pretrained with a mixture of large-scale imagetext and text-only corpora using a joint contrastive and generative loss. We perform experiments on two different datasets that require retrieving and reasoning over both images and text to answer a given query: We-bQA, and MultimodalQA. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets and under both distractor and full-wiki settings.

show abstract

MIMOQA: Multimodal Input Multimodal Output Question Answering

Cited by 15 publications

References 30 publications

Multimodal Conversational AI: A Survey of Datasets and Approaches

Multimodal Conversational AI: A Survey of Datasets and Approaches

DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries

MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text

Contact Info

Product

Resources

About