Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Yoon, Sunjae; Eunseop, Yoon,; Yoon, Hyeun Joong; Junyeong, Kim,; Yoo, Chang D.

doi:10.18653/v1/2022.emnlp-main.280

Cited by 2 publications

(4 citation statements)

References 17 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…HEAR shows state-of-the-art performances on all the metrics compared to previous works (i.e., please refer to Related Work for their detailed captions.). Our baseline DLM is T5 Transformer (Raffel et al, 2020), which is the same baseline (i.e., T5RLM) of THAM (Yoon et al, 2022c), but here, our proposed SAL shows more gains, and further improvements are also shown by applying RLE. As our proposed HEAR is performed in a model-agnostic manner, we also validate other VGD models with HEAR in Table 2.…”

Section: Results On Avsd Benchmarkmentioning

confidence: 99%

“…For the sensible decision in SAL, we introduce two technical contributions: (1) Keyword-based Audio Sensing and (2) Semantic Neural Estimator. HEAR is applied on current runner models (Hori et al, 2019a;Yoon et al, 2022c;Li et al, 2021b) in a model-agnostic manner, where the effectiveness is validated on VGD dataset (i.e., AVSD@DSTC7, AVSD@DSTC8) with steady performance gains on natural language generation metrics.…”

Section: Our Experimental Evidence Inmentioning

confidence: 99%

“…Graph representations (Kim et al, 2021;Pham et al, 2022; were also popular solutions for holding semantic commonalities between the dialogue and video. There have been novel structures (Hori et al, 2019b; to enhance the multi-modal representation and frameworks (Le and Chen, 2020;Yoon et al, 2022c) to improve the quality of responses in terms of bias or word selection. As such, many advances have been made in the multi-modal understanding of VGD systems, but mainly between video and language.…”

Section: Video-grounded Dialoguesmentioning

confidence: 99%

“…This symptom of ignoring input audio in responding to the question will be referred to as the 'deaf response'. Figure 1 represents examples of deaf responses of existing VGD systems (Yoon et al, 2022c;Li et al, 2021a). To the question "Does the video have sound?"…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Yoon,

Kim,

Yoon

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.

show abstract

Section: Results On Avsd Benchmarkmentioning

confidence: 99%

Section: Our Experimental Evidence Inmentioning

confidence: 99%

Section: Video-grounded Dialoguesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Yoon,

Kim,

Yoon

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

show abstract

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Park,

Kim,

Seok

et al. 2024

IEEE Access

Self Cite

View full text Add to dashboard Cite

This paper proposes a video scene segmentation framework referred to as a Contrasting Multi-Modal Similarity (CMS). Video is composed of multiple scenes which are short stories or semantic units of video, with each scene consisting of multiple shots. The task of video scene segmentation aims to semantically segment long videos, such as movies, into the sequence of scenes by identifying the boundaries of each scene transition. Current video scene segmentation frameworks have primarily relied on comparing only the visual cues of adjacent shots to identify scene boundaries. These frameworks have focused on two major approaches: 1) comparing only the visual cues of adjacent frames to distinguish between scenes and 2) performing clustering based on visual cues for distinction among scenes. However, within videos, there exist numerous scenes that are difficult to distinguish using visual information alone, as they often appear similar or ambiguous. Taking inspiration from the aforementioned issues, we propose a framework referred to as CMS that leverages not only visual cues (i.e., shots) but also textual cues (i.e., captions) to semantically distinguish scenes. The new framework, CMS, leverages visual cues and text cues as follows: (1) Generate captions corresponding to each shot using a zero-shot captioning model (Caption Generation). ( 2) Construct similarity score matrices for each modality to measure semantic similarities (Similarity Score Calculation).(3) Based on the above matrix, select similar shots and dissimilar shots for contrastive training (Similarity Score-based Sampling). Our experiments show that the CMS framework advances the performance to exceed the previous state-of-the-art methods with a relatively simple approach without complex model architectures.

show abstract

Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Cited by 2 publications

References 17 publications

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Contact Info

Product

Resources

About