Context-based environmental audio event recognition for scene understanding

Lü, Tong; Wang, Gongyou; Su, Feng

doi:10.1007/s00530-014-0424-7

Cited by 6 publications

(2 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Automatic Audio Captioning (AAC) is an inter-modal translation task, where the objective is to generate a textual description for a corresponding input audio signal [1]. Audio captioning is a critical step towards machine intelligence and many applications in daily scenarios, such as audio retrieval [2], scene understanding [3] [4], applications for the hearing impaired patients [5], detailed audio surveillance etc. Unlike an Automatic Speech Recognition (ASR) task, the output is a description rather than a transcription of the contents within the audio sample.…”

Section: Introductionmentioning

confidence: 99%

Automatic Audio Captioning using Attention weighted Event based Embeddings

Bhosale¹,

Chakraborty²,

Kopparapu³

2022

Preprint

View full text Add to dashboard Cite

Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two stateof-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token.

show abstract

Section: Introductionmentioning

confidence: 99%

Automatic Audio Captioning using Attention weighted Event based Embeddings

Bhosale¹,

Chakraborty²,

Kopparapu³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Automatic Audio Captioning (AAC) is an inter-modal translation task, where the objective is to generate a textual description for a corresponding input audio signal [2]. Audio captioning is a critical step towards machine intelligence with multiple applications in daily scenarios, ranging from audio retrieval [3], scene understanding [4,5] to assist the hearing impaired [6] and audio surveillance. Unlike an Automatic Speech Recognition (ASR) task, the output is a description rather than a transcription of the linguistic content in the audio sample.…”

Section: Introductionmentioning

confidence: 99%

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Bhosale¹,

Chakraborty²,

Kopparapu³

2022

Preprint

View full text Add to dashboard Cite

Automatic Audio Captioning (AAC) refers to the task of translating an audio sample into a natural language (NL) text that describes the audio events, source of the events and their relationships. Unlike NL text generation tasks, which rely on metrics like BLEU, ROUGE, METEOR based on lexical semantics for evaluation, the AAC evaluation metric requires an ability to map NL text (phrases) that correspond to similar sounds in addition lexical semantics. Current metrics used for evaluation of AAC tasks lack an understanding of the perceived properties of sound represented by text. In this paper, we propose a novel metric based on Text-to-Audio Grounding (TAG), which is, useful for evaluating cross modal tasks like AAC. Experiments on publicly available AAC data-set shows our evaluation metric to perform better compared to existing metrics used in NL text and image captioning literature.

show abstract

Approaches to Complex Sound Scene Analysis

Benetos

Stowell

Plumbley

2017

Computational Analysis of Sound Scenes and Events

View full text Add to dashboard Cite

PrefaceThe recent progress on machine learning and signal processing has enabled the development of technologies for automatic analysis of sound scenes and events by computational means. This has attracted several research groups and companies to investigate this new field, which has potential in several applications and also has several research challenges. This book aims to present the state-of-the-art methodology in the field, to serve as a baseline material for people wishing to enter it or to learn more about it.

show abstract

Context-based environmental audio event recognition for scene understanding

Cited by 6 publications

References 37 publications

Automatic Audio Captioning using Attention weighted Event based Embeddings

Automatic Audio Captioning using Attention weighted Event based Embeddings

Text-to-Audio Grounding Based Novel Metric for Evaluating Audio Caption Similarity

Approaches to Complex Sound Scene Analysis

Contact Info

Product

Resources

About