2022
DOI: 10.48550/arxiv.2210.17143
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation

Abstract: In this paper, we propose two novel augmentation methods 1) audiolanguage MixGen (AL-MixGen) and 2) multi-level test-time augmentation (Multi-TTA) for audio-language learning. Inspired by MixGen, which is originally applied to vision-language learning, we introduce an augmentation method for the audio-language domain. We also explore the impact of test-time augmentations and present Multi-TTA which generalizes test-time augmentation over multiple layers of a deep learning model. Incorporating AL-MixGen and Mul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 15 publications
0
2
0
Order By: Relevance
“…TTA has been widely demonstrated to enhance model accuracy and robustness ( Krizhevsky, Sutskever & Hinton, 2012 ; Matsunaga et al, 2017b ; Cohen, Rosenfeld & Kolter, 2019b ), address distribution shift issues ( Zhang, Levine & Finn, 2022 ), and defend against adversarial attacks ( Prakash et al, 2018 ; Gao et al, 2020 ). Researchers have proposed various TTA methods in different domains, including image segmentation ( Moshkov et al, 2020 ), text grammar correction ( Yang et al, 2022 ), text classification ( Lu et al, 2022 ), audio-text retrieval ( Kim et al, 2022 ), theoretical research ( Kimura, 2021 ; Kim, Kim & Kim, 2020 ), and uncertainty estimation ( Conde et al, 2023 ; Conde & Premebida, 2022 ).…”
Section: Related Workmentioning
confidence: 99%
“…TTA has been widely demonstrated to enhance model accuracy and robustness ( Krizhevsky, Sutskever & Hinton, 2012 ; Matsunaga et al, 2017b ; Cohen, Rosenfeld & Kolter, 2019b ), address distribution shift issues ( Zhang, Levine & Finn, 2022 ), and defend against adversarial attacks ( Prakash et al, 2018 ; Gao et al, 2020 ). Researchers have proposed various TTA methods in different domains, including image segmentation ( Moshkov et al, 2020 ), text grammar correction ( Yang et al, 2022 ), text classification ( Lu et al, 2022 ), audio-text retrieval ( Kim et al, 2022 ), theoretical research ( Kimura, 2021 ; Kim, Kim & Kim, 2020 ), and uncertainty estimation ( Conde et al, 2023 ; Conde & Premebida, 2022 ).…”
Section: Related Workmentioning
confidence: 99%
“…The multi-modal retrieval task has been studied using various modalities such as image-text retrieval Zhang et al, 2020;Cheng et al, 2022;Luo et al, 2022;Xuan and Chen, 2023), video-text Gorti et al, 2022;, audio-image (Xu, 2020;Nakatsuka et al, 2023), video-audio (Surís et al, 2018;Gu et al, 2023; and audio-text (Kim et al, 2022;Xin et al, 2023). Particularly, CLIP4CLIP (Luo et al, 2022), which performs well in the videotext retrieval task by calculating the similarities between the features of each modality obtained from the encoder, and X-CLIP expands CLIP4CLIP and proposes a multi-grained regulation function to improve performance.…”
Section: Related Workmentioning
confidence: 99%