2021 29th European Signal Processing Conference (EUSIPCO) 2021
DOI: 10.23919/eusipco54536.2021.9616340
|View full text |Cite
|
Sign up to set email alerts
|

WaveTransformer: An Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Abstract: Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from image captioning or machine translation fields. In this work, we present a novel AAC method, explicitly focused on the exploitation of the temporal and timefrequency patterns in audio. We employ three learnable processes for audio encoding, two for extracting the temporal and timefrequency information, and one to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 17 publications
0
5
0
Order By: Relevance
“…The success in the image captioning field has become a massive motivation for generating captions from one-dimensional signals. One of the most considerable 1D signals is audio, with several attempts to access it, such as the convolution-augmented Transformer (Conformer) [24], WaveTransformer [25], and adversarial training [26]. Some suggestions in these studies still need to be applied in medical signals, such as augmentation by mixing signals and generative adversarial network (GAN) to increase the diversity of the generated captions.…”
Section: Related Workmentioning
confidence: 99%
“…The success in the image captioning field has become a massive motivation for generating captions from one-dimensional signals. One of the most considerable 1D signals is audio, with several attempts to access it, such as the convolution-augmented Transformer (Conformer) [24], WaveTransformer [25], and adversarial training [26]. Some suggestions in these studies still need to be applied in medical signals, such as augmentation by mixing signals and generative adversarial network (GAN) to increase the diversity of the generated captions.…”
Section: Related Workmentioning
confidence: 99%
“…An encoderdecoder model with caption decoder and content word decoder is presented in [19] to solve infrequent class problems in the captions. A transformer model is presented in [3] using temporal and time-frequency information in audio clips. Another transformer-based architecture is proposed in [21] to learn information with a continuously adapting approach.…”
Section: B Audio Captioningmentioning
confidence: 99%
“…Since the Clotho dataset has five captions for each audio clip, we can obtain up to 50 topics for an audio clip. We have experimented with different numbers of topics (2,3,10) for an audio caption using the BERTopic to explore how many topics we should use in the model for each caption. Let k be the number of topics obtained from the topic model for five captions, T = [t 1 , ..., t k ] is the topic vector with the length of k. When we experiment with two topics for each…”
Section: B Topic Modeling With Bertopicmentioning
confidence: 99%
See 1 more Smart Citation
“…Unlike the field of computer vision, where considerable research has been carried out on higher-levels of semantic understanding of visual tasks (e.g., visual question answering: Agrawal et al, 2017; Yang et al, 2016; image captioning: Xu et al, 2015; Lu et al, 2017), only a few works have been realised in the audio domain. One example is the recent work described in (Drossos et al, 2017), followed by their current approach in Tran et al (2020), in which an encoder–decoder neural network is used to process a sequence of Mel-band energies and to compute a sequence of words that describe a given audio segment.…”
Section: State-of-the-art In Audio Analysismentioning
confidence: 99%