Audio Captioning Based on Combined Audio and Semantic Embeddings

Eren, Aysegul Ozkaya; Sert, Mustafa

doi:10.1109/ism.2020.00014

Cited by 20 publications

(17 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The semantic attributes were originally used in [12], where AudioSet labels were used as semantic attributes by using the labels of the nearest video clip. And Eren et al [13] used the audio encoder to get audio embeddings and a text encoder to get subject-verb embeddings, combine these embeddings and decode them in the decoder.…”

Section: Related Workmentioning

confidence: 99%

Caption Feature Space Regularization for Audio Captioning

Zhang¹,

Yang²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to several captions with diverse semantics). For that, general audio captioning models achieve the one-to-many training by randomly selecting a correlated caption as the ground truth for each audio. However, it leads to a significant variation in the optimization directions and weakens the model stability. To eliminate this negative effect, in this paper, we propose a two-stage framework for audio captioning: (i) in the first stage, via the contrastive learning, we construct a proxy feature space to reduce the distances between captions correlated to the same audio, and (ii) in the second stage, the proxy feature space is utilized as additional supervision to encourage the model to be optimized in the direction that benefits all the correlated captions. We conducted extensive experiments on two datasets using four commonly used encoder and decoder architectures. Experimental results demonstrate the effectiveness of the proposed method. The code is available at https://github. com/PRIS-CV/Caption-Feature-Space-Regularization.

show abstract

Section: Related Workmentioning

confidence: 99%

Caption Feature Space Regularization for Audio Captioning

Zhang¹,

Yang²,

Du³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition, 1-D CNN is also incorporated to better exploit temporal patterns. For example, Eren et al [35] and Han et al [33] used Wavegram-Logmel-CNN adapted from PANNs [23], which takes raw waveform for 1-D convolution and spectrogram for 2-D convolution and combine the outputs of 1-D convolutional layers and 2-D convolutional layers in deep layers. Tran et al [36] also proposed to utilise 1-D and 2-D convolutions for extracting temporal and time-frequency information, however, they only used spectrogram as input and reshape it for 1-D convolution.…”

Section: Cnnsmentioning

confidence: 99%

“…Tran et al [36] also proposed to utilise 1-D and 2-D convolutions for extracting temporal and time-frequency information, however, they only used spectrogram as input and reshape it for 1-D convolution. To obtain global audio features, some methods use a global pooling after the last convolutional block to summarize feature maps into a vector of fixed size [35], while some keep the time axis to get fine-grained temporal features and utilize an attention module to attend to the informative features when performing language decoding [31,32].…”

Section: Cnnsmentioning

confidence: 99%

“…These pre-trained word embeddings are trained using neural networks with large corpus, and could capture semantic information, that is, semantically similar words are close to each other in the embedding space [45]. Word2Vec [45], GloVe [46] and fastText [47] are widely used in existing audio captioning works [15,17,30,35,39]. With the popularity of large pretrained language models, Weck et al [48] employed BERT [49] as a feature extractor to obtain word embeddings.…”

Section: Word Embeddingsmentioning

confidence: 99%

See 1 more Smart Citation

Automated Audio Captioning: an Overview of Recent Progress and New Challenges

Mei,

Liu,

Plumbley

et al. 2022

Preprint

View full text Add to dashboard Cite

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent years. The problem has been addressed predominantly with deep learning techniques. Numerous approaches have been proposed, such as investigating different neural network architectures, exploiting auxiliary information such as keywords or sentence information to guide caption generation, and employing different training strategies, which have greatly facilitated the development of this field. In this paper, we present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets. Moreover, we discuss open challenges and envisage possible future research directions.

show abstract

“…Recently, several audio captioning datasets have been introduced, such as CLOTHO [9] which was used in the DCASE automated audio captioning challenge 2020 [27], Audio Caption [28], and AUDIOCAPS [8]. Multiple works have addressed automatic audio captioning on the AUDIOCAPS dataset [29,30,31]. In this work, we use the AUDIOCAPS and CLOTHO datasets for crossmodal retrieval.…”

Section: Related Workmentioning

confidence: 99%

Audio Retrieval with Natural Language Queries

Oncescu¹,

Koepke²,

Henriques³

et al. 2021

Preprint

View full text Add to dashboard Cite

We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the AUDIOCAPS and CLOTHO datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into cross-modal text-based audio retrieval with free-form text queries.

show abstract

Audio Captioning Based on Combined Audio and Semantic Embeddings

Cited by 20 publications

References 17 publications

Caption Feature Space Regularization for Audio Captioning

Caption Feature Space Regularization for Audio Captioning

Automated Audio Captioning: an Overview of Recent Progress and New Challenges

Audio Retrieval with Natural Language Queries

Contact Info

Product

Resources

About