Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation

Eungbeom, Kim,; Kim, Jinhee; Oh, Yoori; Kim, Kyungsu; Park, Minju; Sim, Jaeheon; Lee, Jun Young; Lee, Kyogu

doi:10.48550/arxiv.2210.17143

Cited by 2 publications

(2 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…TTA has been widely demonstrated to enhance model accuracy and robustness ( Krizhevsky, Sutskever & Hinton, 2012 ; Matsunaga et al, 2017b ; Cohen, Rosenfeld & Kolter, 2019b ), address distribution shift issues ( Zhang, Levine & Finn, 2022 ), and defend against adversarial attacks ( Prakash et al, 2018 ; Gao et al, 2020 ). Researchers have proposed various TTA methods in different domains, including image segmentation ( Moshkov et al, 2020 ), text grammar correction ( Yang et al, 2022 ), text classification ( Lu et al, 2022 ), audio-text retrieval ( Kim et al, 2022 ), theoretical research ( Kimura, 2021 ; Kim, Kim & Kim, 2020 ), and uncertainty estimation ( Conde et al, 2023 ; Conde & Premebida, 2022 ).…”

Section: Related Workmentioning

confidence: 99%

STTA: enhanced text classification via selective test-time augmentation

Xiong,

Zhang,

Yang

et al. 2023

PeerJ Computer Science

View full text Add to dashboard Cite

Test-time augmentation (TTA) is a well-established technique that involves aggregating transformed examples of test inputs during the inference stage. The goal is to enhance model performance and reduce the uncertainty of predictions. Despite its advantages of not requiring additional training or hyperparameter tuning, and being applicable to any existing model, TTA is still in its early stages in the field of NLP. This is partly due to the difficulty of discerning the contribution of different transformed samples, which can negatively impact predictions. In order to address these issues, we propose Selective Test-Time Augmentation, called STTA, which aims to select the most beneficial transformed samples for aggregation by identifying reliable samples. Furthermore, we analyze and empirically verify why TTA is sensitive to some text data augmentation methods and reveal why some data augmentation methods lead to erroneous predictions. Through extensive experiments, we demonstrate that STTA is a simple and effective method that can produce promising results in various text classification tasks.

show abstract

Section: Related Workmentioning

confidence: 99%

STTA: enhanced text classification via selective test-time augmentation

Xiong,

Zhang,

Yang

et al. 2023

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…The multi-modal retrieval task has been studied using various modalities such as image-text retrieval Zhang et al, 2020;Cheng et al, 2022;Luo et al, 2022;Xuan and Chen, 2023), video-text Gorti et al, 2022;, audio-image (Xu, 2020;Nakatsuka et al, 2023), video-audio (Surís et al, 2018;Gu et al, 2023; and audio-text (Kim et al, 2022;Xin et al, 2023). Particularly, CLIP4CLIP (Luo et al, 2022), which performs well in the videotext retrieval task by calculating the similarities between the features of each modality obtained from the encoder, and X-CLIP expands CLIP4CLIP and proposes a multi-grained regulation function to improve performance.…”

Section: Related Workmentioning

confidence: 99%

Sound of Story: Multi-modal Storytelling with Audio

Bae,

Jeong,

Kang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Storytelling is multi-modal in the real world. When one tells a story, one may use all of the visualizations and sounds along with the story itself. However, prior studies on storytelling datasets and tasks have paid little attention to sound even though sound also conveys meaningful semantics of the story. Therefore, we propose to extend story understanding and telling areas by establishing a new component called background sound which is story context-based audio without any linguistic information. For this purpose, we introduce a new dataset, called Sound of Story (SoS), which has paired image and text sequences with corresponding sound or background music for a story. To the best of our knowledge, this is the largest well-curated dataset for storytelling with sound. Our SoS dataset consists of 27,354 stories with 19.6 images per story and 984 hours of speech-decoupled audio such as background music and other sounds. As benchmark tasks for storytelling with sound and the dataset, we propose retrieval tasks between modalities, and audio generation tasks from image-text sequences, introducing strong baselines for them. We believe the proposed dataset and tasks may shed light on the multi-modal understanding of storytelling in terms of sound. Downloading the dataset and baseline codes for each task will be released in the link: https: //github.com/Sosdatasets/SoS_Dataset.

show abstract

Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation

Cited by 2 publications

References 15 publications

STTA: enhanced text classification via selective test-time augmentation

STTA: enhanced text classification via selective test-time augmentation

Sound of Story: Multi-modal Storytelling with Audio

Contact Info

Product

Resources

About