A Dataset for Medical Instructional Video Classification and Question Answering

Gupta, Deepak; Attal, Kush; Demner‐Fushman, Dina

doi:10.48550/arxiv.2201.12888

Cited by 3 publications

(15 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Medical Video Question Answering (MedVidQA) datasets [6] is the first video question answering (VQA) dataset [45] constructed in natural language video localization (NLVL) [21,46] , which aims to provide medical instructional video with text question query. Three medical informatics experts were asked to formulate the medical and health-related instructional questions by watching the given video.…”

Section: Datasetsmentioning

confidence: 99%

“…Following prior works [6,21,26,47,48], we adopt "R@n, IoU = 𝜇" and "mIoU" as the evaluation metrics, which treats localization of the frames in the video as a span prediction task similar to answer span prediction [49,50] in text-based question answering. The "R@n, IoU = 𝜇" denotes the percentage of language queries having at least one result whose Inter-section over Union (IoU) with ground truth is larger than 𝜇 in top-n retrieved moments.…”

Section: Evaluation Metricsmentioning

confidence: 99%

“…Many people prefer instructional videos to teach or learn how to accomplish a particular task with a series of stepby-step procedures [4]. The temporal answering grounding in the video (TAGV) is a new task that has attracted increasing attention due to the visual and verbal communication at the same time in an effective and efficient manner [5,6]. The goal of the TAGV task is to find the matching video answer span corresponding to its question, aka., visual answering localization.…”

mentioning

confidence: 99%

“…People can easily answer through natural language but are hard to act without the moment guidance in the video to demonstrate their answers. As shown at the top of Figure 1, this example illustrates the temporal answering localization in the medical instructional video, where the figure is borrowed from the original work [6] with the author's permission and change. As we can see in this figure, the particular temporal answering segment is preferred rather than the entire video as the answer to the given question "How to examine lymph nodes in a head and neck ?".…”

mentioning

confidence: 99%

“…As we can see in this figure, the particular temporal answering segment is preferred rather than the entire video as the answer to the given question "How to examine lymph nodes in a head and neck ?". How to design a cross-modal method that can locate the video timeline correctly is still one of the key points in the current research [6,10].…”

mentioning

confidence: 99%

See 4 more Smart Citations

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

Li¹,

Weng²,

Sun³

et al. 2022

Preprint

View full text Add to dashboard Cite

The temporal answering grounding in the video (TAGV) is a new task naturally derived from temporal sentence grounding in the video (TSGV). Given an untrimmed video and a text question, this task aims at locating the matching span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based question answering (QA) approach by matching the visual frame span queried by the text question. However, due to the weak correlations and huge gaps of the semantic features between the textual question and visual answer, existing methods adopting visual span predictor perform poorly in the TAGV task. To bridge these gaps, we propose a visualprompt text span localizing (VPTSL) method, which introduces the timestamped subtitles as a passage to perform the text span localization for the input text question, and prompts the visual highlight features into the pre-trained language model (PLM) for enhancing the joint semantic representations. Specifically, the context query attention is utilized to perform cross-modal interaction between the extracted textual and visual features. Then, the highlight features are obtained through the video-text highlighting for the visual prompt. To alleviate semantic differences between textual and visual features, we design the text span predictor by encoding the question, the subtitles, and the prompted visual highlight features with the PLM. As a result, the TAGV task is formulated to predict the span of subtitles matching the visual answer. Extensive experiments on the medical instructional dataset, namely MedVidQA, show that the proposed VPTSL outperforms the state-of-the-art (SOTA) method by 28.36% in terms of mIOU with a large margin, which demonstrates the effectiveness of the proposed visual prompt and the text span predictor. CCS CONCEPTS• Information systems → Video search; • Computing methodologies → Neural networks.

show abstract

Section: Datasetsmentioning

confidence: 99%

Section: Evaluation Metricsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

Li¹,

Weng²,

Sun³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Ghosh,

Acharya,

Jain

et al. 2024

AAAI

View full text Add to dashboard Cite

In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care. Disclaimer: The article features graphic medical imagery, a result of the subject's inherent requirements.

show abstract

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Li¹,

Li²,

Sun³

et al. 2023

Preprint

View full text Add to dashboard Cite

<p>The temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question. Existing methods tend to formulate the TAGV task with a visual span-based predictor by matching the video frame span queried by the text question. However, due to the weak correlations of the semantic features between the textual question and visual answer, existing methods adopting the visual span-based predictor perform poorly in the TAGV task. In this paper, we propose a visual-prompt text span localizing (VPTSL) method, which introduces the timestamped subtitles to perform the text span localization. Specifically, we design the text span-based predictor, where the input text question, video subtitles, and visual prompt features are jointly learned with the pre-trained language model for enhancing the joint semantic representations. As a result, the TAGV task is reformulated with the visual-prompt subtitle span prediction matching the visual answer. Extensive experiments on three instructional video datasets, namely MedVidQA, TutorialVQA, and VehicleVQA, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor. Besides, all the experimental codes and datasets are open-sourced on the website https://github.com/wengsyx/VPTSL.</p>

show abstract

A Dataset for Medical Instructional Video Classification and Question Answering

Cited by 3 publications

References 23 publications

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Contact Info

Product

Resources

About