2021
DOI: 10.31219/osf.io/zepsq
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Abstract: Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery.Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedded in the audio automatically. This paper first proposes a topic model for audio descriptions, comprehensively analy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(15 citation statements)
references
References 14 publications
(15 reference statements)
0
15
0
Order By: Relevance
“…We compare our proposed approach with five baseline methods, namely, the TopDown-AlignedAtt [8] model, the CNN10-AT model [11] which uses pre-trained Audio Tagging model as the encoder, the Audio Captioning Transformer (ACT) [3], which is the first convolution-free architecture, the model in [15] that uses frozen GPT-2 and audio-based similar caption retrieval, and finally the current state-of-the-art model [17] on AudioCaps based on BART and AudioSet tags.…”
Section: A Comparison With Baseline Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We compare our proposed approach with five baseline methods, namely, the TopDown-AlignedAtt [8] model, the CNN10-AT model [11] which uses pre-trained Audio Tagging model as the encoder, the Audio Captioning Transformer (ACT) [3], which is the first convolution-free architecture, the model in [15] that uses frozen GPT-2 and audio-based similar caption retrieval, and finally the current state-of-the-art model [17] on AudioCaps based on BART and AudioSet tags.…”
Section: A Comparison With Baseline Methodsmentioning
confidence: 99%
“…To address the data scarcity issue of audio captioning, transferring knowledge from pre-trained audio models has been widely investigated. Xu et al [11] propose an approach that uses transfer learning to exploit local and global information from audio tagging and acoustic scene classification, respectively. Pre-trained Audio Neural Networks (PANNs) [12] are the models pre-trained on AudioSet [13], which have achieved great success as the encoder [5]- [7], [14] in the audio captioning system.…”
Section: Introductionmentioning
confidence: 99%
“…• SPIDEr is the linear combination of CIDEr and SPICE. The first three metrics are proposed for machine translation systems but are also widely used to evaluate AAC systems in the previous works [1,2,4,11]. The last three metrics are specifically used for captioning task [30,31,32].…”
Section: Metricsmentioning
confidence: 99%
“…Most AAC approaches typically adopt encoder-decoder structures [3], where the audio encoder extracts acoustic features from raw audio inputs, and the text decoder generates corresponding descriptive captions. Recent work [4] observed that it is difficult to train a strong encoder for audio inputs because the supervision only comes from captions, which is quite limited. To overcome such problem, prior studies [2,5,6,7,8,9] proposed transfer learning to pre-train audio encoders on Audioset [10] for better acoustic features.…”
Section: Introductionmentioning
confidence: 99%
“…K. Chen et al [10] used the combination of a pre-trained encoder and a Transformer decoder which makes the latent variable result more efficient in generating captions. X. Xu et al [11] investigated the effect of local and global information on the audio captioning task by comparing two pre-training tasks. The semantic information is also investigated in order to improve the audio captioning task's performance.…”
Section: Related Workmentioning
confidence: 99%