Clotho: an Audio Captioning Dataset

Drossos, Konstantinos; Lipping, Samuel; Virtanen, Tuomas

doi:10.1109/icassp40776.2020.9052990

Cited by 114 publications

(115 citation statements)

References 11 publications

(24 reference statements)

Supporting

Mentioning

114

Contrasting

Order By: Relevance

“…[49][50][51][52][53][54][55][56]60,61]. Furthermore, recent audio and audiovisual captioning trends can offer additional semantic conceptualization meta-data [62][63][64][65]. These meta-information augmentation perspectives can accompany the above-discussed sustainable growth and well-being indicators, suggesting added-value innovative services for soundscape preservation and their engaging promotion at environmental, ecological, and heritage views.…”

Section: Related Workmentioning

confidence: 99%

“…The proposed modular architecture allows the attachment of multi-channeled ambisonics sensors to the client terminal (i.e., soundfield microphones), to apply more sophisticated spatiotemporal localization and mapping that could facilitate the audiovisual content description and management [49][50][51]74,75]. On the other side, more demanding semantic analysis can be performed on a batch processing mode, as a cloud service, making use of recent advantages on Convolutional Neural Networks (CNN), Deep Learning (DL), and multimodal decisionmaking systems [58][59][60][61][62][63][64][65]. The focus here lies in the discrimination of time-concurrent audio events in a hierarchical classification taxonomy.…”

Section: Integration Of State-of-the-art Audio and Soundscape Semantimentioning

confidence: 99%

See 1 more Smart Citation

Semantic Crowdsourcing of Soundscapes Heritage: A Mojo Model for Data-Driven Storytelling

et al. 2021

View full text Add to dashboard Cite

The current paper focuses on the development of an enhanced Mobile Journalism (MoJo) model for soundscape heritage crowdsourcing, data-driven storytelling, and management in the era of big data and the semantic web. Soundscapes and environmental sound semantics have a great impact on cultural heritage, also affecting the quality of human life, from multiple perspectives. In this view, context- and location-aware mobile services can be combined with state-of-the-art machine and deep learning approaches to offer multilevel semantic analysis monitoring of sound-related heritage. The targeted utilities can offer new insights toward sustainable growth of both urban and rural areas. Much emphasis is also put on the multimodal preservation and auralization of special soundscape areas and open ancient theaters with remarkable acoustic behavior, representing important cultural artifacts. For this purpose, a pervasive computing architecture is deployed and investigated, utilizing both client- and cloud-wise semantic analysis services, to implement and evaluate the envisioned MoJo methodology. Elaborating on previous/baseline MoJo tools, research hypotheses and questions are stated and put to test as part of the human-centered application design and development process. In this setting, primary algorithmic backend services on sound semantics are implemented and thoroughly validated, providing a convincing proof of concept of the proposed model.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Integration Of State-of-the-art Audio and Soundscape Semantimentioning

confidence: 99%

Semantic Crowdsourcing of Soundscapes Heritage: A Mojo Model for Data-Driven Storytelling

et al. 2021

View full text Add to dashboard Cite

show abstract

“…their corresponding physical properties, temporal information of these sound events, and their relationship with other events, and high-level knowledge-rich auditory understanding. For instance, a typical caption from the DCASE benchmark dataset Clotho [7] "people talking in a small and empty room" describes the sound event "people talking" and its global scene "in a room", where high-level auditory knowledge is processed to infer that the room is small and empty, a visual description.…”

Section: Introductionmentioning

confidence: 99%

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Xu¹,

Dinkel²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery.Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedded in the audio automatically. This paper first proposes a topic model for audio descriptions, comprehensively analyzing the hierarchical audio topics that are commonly covered.We then explore a transfer learning scheme to access local and global information.Two source tasks are identified to respectively represent local and global information, being Audio Tagging (AT) and Acoustic Scene Classification (ASC).Experiments are conducted on the AAC benchmark dataset Clotho and Audiocaps, amounting to a vast increase in all eight metrics with topic transfer learning. Further, it is discovered that local information and abstract representation learning are more crucial to AAC than global information and temporal relationship learning.

show abstract

“…Audio classification is a well-studied research field [1][2][3][4][5] with a wide variety of applications such as multimedia search and retrieval [4], urban sound monitoring [6], bioacoustic monitoring [7], and audio captioning [8]. Most recent audio classification methods employ a standard supervised learning approach applied to deep neural networks.…”

Section: Introductionmentioning

confidence: 99%

Few-Shot Continual Learning for Audio Classification

Wang

Bryan

Cartwright

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Supervised learning for audio classification typically imposes a fixed class vocabulary, which can be limiting for real-world applications where the target class vocabulary is not known a priori or changes dynamically. In this work, we introduce a few-shot continual learning framework for audio classification, where we can continuously expand a trained base classifier to recognize novel classes based on only few labeled data at inference time. This enables fast and interactive model updates by end-users with minimal human effort. To do so, we leverage the dynamic few-shot learning technique and adapt it to a challenging multi-label audio classification scenario. We incorporate a recent state-of-the-art audio feature extraction model as a backbone and perform a comparative analysis of our approach on two popular audio datasets (ESC-50 and AudioSet). We conduct an in-depth evaluation to illustrate the complexities of the problem and show that, while there is still room for improvement, our method outperforms three baselines on novel class detection while maintaining its performance on base classes.

show abstract

Clotho: an Audio Captioning Dataset

Cited by 114 publications

References 11 publications

Semantic Crowdsourcing of Soundscapes Heritage: A Mojo Model for Data-Driven Storytelling

Semantic Crowdsourcing of Soundscapes Heritage: A Mojo Model for Data-Driven Storytelling

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Few-Shot Continual Learning for Audio Classification

Contact Info

Product

Resources

About