2020 IEEE International Symposium on Multimedia (ISM) 2020
DOI: 10.1109/ism.2020.00014
|View full text |Cite
|
Sign up to set email alerts
|

Audio Captioning Based on Combined Audio and Semantic Embeddings

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 20 publications
(17 citation statements)
references
References 17 publications
0
16
0
Order By: Relevance
“…The semantic attributes were originally used in [12], where AudioSet labels were used as semantic attributes by using the labels of the nearest video clip. And Eren et al [13] used the audio encoder to get audio embeddings and a text encoder to get subject-verb embeddings, combine these embeddings and decode them in the decoder.…”
Section: Related Workmentioning
confidence: 99%
“…The semantic attributes were originally used in [12], where AudioSet labels were used as semantic attributes by using the labels of the nearest video clip. And Eren et al [13] used the audio encoder to get audio embeddings and a text encoder to get subject-verb embeddings, combine these embeddings and decode them in the decoder.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, 1-D CNN is also incorporated to better exploit temporal patterns. For example, Eren et al [35] and Han et al [33] used Wavegram-Logmel-CNN adapted from PANNs [23], which takes raw waveform for 1-D convolution and spectrogram for 2-D convolution and combine the outputs of 1-D convolutional layers and 2-D convolutional layers in deep layers. Tran et al [36] also proposed to utilise 1-D and 2-D convolutions for extracting temporal and time-frequency information, however, they only used spectrogram as input and reshape it for 1-D convolution.…”
Section: Cnnsmentioning
confidence: 99%
“…Tran et al [36] also proposed to utilise 1-D and 2-D convolutions for extracting temporal and time-frequency information, however, they only used spectrogram as input and reshape it for 1-D convolution. To obtain global audio features, some methods use a global pooling after the last convolutional block to summarize feature maps into a vector of fixed size [35], while some keep the time axis to get fine-grained temporal features and utilize an attention module to attend to the informative features when performing language decoding [31,32].…”
Section: Cnnsmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, several audio captioning datasets have been introduced, such as CLOTHO [9] which was used in the DCASE automated audio captioning challenge 2020 [27], Audio Caption [28], and AUDIOCAPS [8]. Multiple works have addressed automatic audio captioning on the AUDIOCAPS dataset [29,30,31]. In this work, we use the AUDIOCAPS and CLOTHO datasets for crossmodal retrieval.…”
Section: Related Workmentioning
confidence: 99%