2021
DOI: 10.1142/s1793351x21400018
|View full text |Cite
|
Sign up to set email alerts
|

Audio Captioning with Composition of Acoustic and Semantic Information

Abstract: Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder–decoder-based models without considering semantic information. To fill this gap, we present a novel encoder–decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the aud… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 23 publications
(34 reference statements)
0
4
0
Order By: Relevance
“…The growing presence of publicly available datasets has led to increasing research in the AAC task. Several studies have addressed audio captioning on the Clotho [17]- [19] and AudioCaps [18], [20] datasets.…”
Section: B Audio Captioningmentioning
confidence: 99%
See 3 more Smart Citations
“…The growing presence of publicly available datasets has led to increasing research in the AAC task. Several studies have addressed audio captioning on the Clotho [17]- [19] and AudioCaps [18], [20] datasets.…”
Section: B Audio Captioningmentioning
confidence: 99%
“…A transformer model with keyword estimation is proposed in [4]. [18] improves audio captioning performance by extracting subject-verb keywords from the captions.…”
Section: B Audio Captioningmentioning
confidence: 99%
See 2 more Smart Citations
“…For instance, [62] used a basic sequence-to-sequence Bi-LSTM [63] model for encoding hidden states of the audio and text. Similarly, GRUs have also been used to encode audio data with less success [11,64] used to apply non-linear transformations to the features extracted by the selfattention mechanism. GPT-2 was trained on a large corpus of text data [21], and its objective was to predict the next word in a sequence given the preceding context.…”
Section: Automated Audio Captioningmentioning
confidence: 99%