Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Xu, Xuenan; Dinkel, Heinrich; Wu, Mengyue; Xie, Zhijie; Yu, Kai

doi:10.31219/osf.io/zepsq

Cited by 8 publications

(15 citation statements)

References 14 publications

(15 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare our proposed approach with five baseline methods, namely, the TopDown-AlignedAtt [8] model, the CNN10-AT model [11] which uses pre-trained Audio Tagging model as the encoder, the Audio Captioning Transformer (ACT) [3], which is the first convolution-free architecture, the model in [15] that uses frozen GPT-2 and audio-based similar caption retrieval, and finally the current state-of-the-art model [17] on AudioCaps based on BART and AudioSet tags.…”

Section: A Comparison With Baseline Methodsmentioning

confidence: 99%

“…To address the data scarcity issue of audio captioning, transferring knowledge from pre-trained audio models has been widely investigated. Xu et al [11] propose an approach that uses transfer learning to exploit local and global information from audio tagging and acoustic scene classification, respectively. Pre-trained Audio Neural Networks (PANNs) [12] are the models pre-trained on AudioSet [13], which have achieved great success as the encoder [5]- [7], [14] in the audio captioning system.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Leveraging Pre-trained BERT for Audio Captioning

Liu¹,

Mei²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio captioning aims at using language to describe the content of an audio clip. Existing audio captioning systems are generally based on an encoder-decoder architecture, in which acoustic information is extracted by an audio encoder and then a language decoder is used to generate the captions. Training an audio captioning system often encounters the problem of data scarcity. Transferring knowledge from pre-trained audio models such as Pre-trained Audio Neural Networks (PANNs) have recently emerged as a useful method to mitigate this issue. However, there is less attention on exploiting pre-trained language models for the decoder, compared with the encoder. BERT is a pre-trained language model that has been extensively used in natural language processing tasks. Nevertheless, the potential of using BERT as the language decoder for audio captioning has not been investigated. In this study, we demonstrate the efficacy of the pre-trained BERT model for audio captioning. Specifically, we apply PANNs as the encoder and initialize the decoder from the publicly available pre-trained BERT models. We conduct an empirical study on the use of these BERT models for the decoder in the audio captioning model. Our models achieve competitive results with the existing audio captioning methods on the AudioCaps dataset.

show abstract

Section: A Comparison With Baseline Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Leveraging Pre-trained BERT for Audio Captioning

Liu¹,

Mei²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…• SPIDEr is the linear combination of CIDEr and SPICE. The first three metrics are proposed for machine translation systems but are also widely used to evaluate AAC systems in the previous works [1,2,4,11]. The last three metrics are specifically used for captioning task [30,31,32].…”

Section: Metricsmentioning

confidence: 99%

“…Most AAC approaches typically adopt encoder-decoder structures [3], where the audio encoder extracts acoustic features from raw audio inputs, and the text decoder generates corresponding descriptive captions. Recent work [4] observed that it is difficult to train a strong encoder for audio inputs because the supervision only comes from captions, which is quite limited. To overcome such problem, prior studies [2,5,6,7,8,9] proposed transfer learning to pre-train audio encoders on Audioset [10] for better acoustic features.…”

Section: Introductionmentioning

confidence: 99%

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

Chen¹,

Hou²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the cross-modal decoding task. In this work, we propose a novel AAC system called CLIP-AAC to learn interactive cross-modality representation with both acoustic and textual information. Specifically, the proposed CLIP-AAC introduces an audio-head and a texthead in the pre-trained encoder to extract audio-text information. Furthermore, we also apply contrastive learning to narrow the domain difference by learning the correspondence between the audio signal and its paired captions. Experimental results show that the proposed CLIP-AAC approach surpasses the best baseline by a significant margin on the Clotho dataset in terms of NLP evaluation metrics. The ablation study indicates that both the pre-trained model and contrastive learning contribute to the performance gain of the AAC model.

show abstract

“…K. Chen et al [10] used the combination of a pre-trained encoder and a Transformer decoder which makes the latent variable result more efficient in generating captions. X. Xu et al [11] investigated the effect of local and global information on the audio captioning task by comparing two pre-training tasks. The semantic information is also investigated in order to improve the audio captioning task's performance.…”

Section: Related Workmentioning

confidence: 99%

Caption Feature Space Regularization for Audio Captioning

Zhang¹,

Yang²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github. com/PRIS-CV/Caption-Feature-Space-Regularization.

show abstract

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Cited by 8 publications

References 14 publications

Leveraging Pre-trained BERT for Audio Captioning

Leveraging Pre-trained BERT for Audio Captioning

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

Caption Feature Space Regularization for Audio Captioning

Contact Info

Product

Resources

About