Audio Captioning with Composition of Acoustic and Semantic Information

Eren, Aysegul Ozkaya; Sert, Mustafa

doi:10.1142/s1793351x21400018

Cited by 2 publications

(6 citation statements)

References 23 publications

(34 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The growing presence of publicly available datasets has led to increasing research in the AAC task. Several studies have addressed audio captioning on the Clotho [17]- [19] and AudioCaps [18], [20] datasets.…”

Section: B Audio Captioningmentioning

confidence: 99%

“…A transformer model with keyword estimation is proposed in [4]. [18] improves audio captioning performance by extracting subject-verb keywords from the captions.…”

Section: B Audio Captioningmentioning

confidence: 99%

“…Previous studies have shown the performance of the pretrained acoustic embeddings such as VGGish [33] and pretrained Audio Neural Networks (PANNs) [34] than other representations such as spectrograms, log Mel energies [18], [22]. Thus, we use PANNs as feature extractor.…”

Section: A Feature Extractormentioning

confidence: 99%

“…For keyword extraction, we use our previous keyword extraction method in [18]. We extract subjects and verbs from the dataset captions.…”

Section: ) Extracting Events and Abstract Experimentsmentioning

confidence: 99%

See 3 more Smart Citations

Automated Audio Captioning With Topic Modeling

Eren

Sert

2023

IEEE Access

View full text Add to dashboard Cite

Automatic audio captioning (AAC) is an important area of research aimed at generating meaningful descriptions for audio clips. Most existing methods use relevant semantic information to improve AAC performance and have demonstrated the feasibility of semantic information extraction. Audio events and keywords are commonly used for this purpose. Unlike previous studies, this study proposes a framework that uses topic modeling to obtain relevant semantic content since topic models explore the main themes of the documents. To this end, we present a framework that integrates audio embeddings with audio topics in a transformer-based encoder-decoder architecture. First, we represent each audio clip with a set of topics using a pre-trained topic model, BERTopic. Then, we design a multilayer perceptron (MLP)based multi-label classifier to predict the topics of audio clips in the testing phase. Finally, in the proposed framework, we input audio embedding and extracted topics into the transformer model to generate captions. The results show that the proposed model improves performance and competes with the most advanced methods that utilize additional external data for training. We believe that the topic modeling can be used to extract semantic content in the AAC task.

show abstract

Section: B Audio Captioningmentioning

confidence: 99%

“…A transformer model with keyword estimation is proposed in [4]. [18] improves audio captioning performance by extracting subject-verb keywords from the captions.…”

Section: B Audio Captioningmentioning

confidence: 99%

Section: A Feature Extractormentioning

confidence: 99%

“…For keyword extraction, we use our previous keyword extraction method in [18]. We extract subjects and verbs from the dataset captions.…”

Section: ) Extracting Events and Abstract Experimentsmentioning

confidence: 99%

See 2 more Smart Citations

Automated Audio Captioning With Topic Modeling

Eren

Sert

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…For instance, [62] used a basic sequence-to-sequence Bi-LSTM [63] model for encoding hidden states of the audio and text. Similarly, GRUs have also been used to encode audio data with less success [11,64] used to apply non-linear transformations to the features extracted by the selfattention mechanism. GPT-2 was trained on a large corpus of text data [21], and its objective was to predict the next word in a sequence given the preceding context.…”

Section: Automated Audio Captioningmentioning

confidence: 99%

Audio captioning and retrieval with improved cross-modal objectives

Koh

View full text Add to dashboard Cite

Please select one of the following; *delete as appropriate: *(A) This thesis does not contain any materials from papers published in peer-reviewed journals or from papers accepted at conferences in which I am listed as an author. *(B) This thesis contains material from 5 papers published in the following peer-reviewed journals) / from papers accepted at conferences in which I am listed as an author.

show abstract

Audio Captioning with Composition of Acoustic and Semantic Information

Cited by 2 publications

References 23 publications

Automated Audio Captioning With Topic Modeling

Automated Audio Captioning With Topic Modeling

Audio captioning and retrieval with improved cross-modal objectives

Contact Info

Product

Resources

About