2023
DOI: 10.48550/arxiv.2302.09719
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Synergy between human and machine approaches to sound/scene recognition and processing: An overview of ICASSP special session

Abstract: Machine Listening, as usually formalized, attempts to perform a task that is, from our perspective, fundamentally human-performable, and performed by humans. Current automated models of Machine Listening vary from purely datadriven approaches to approaches imitating human systems. In recent years, the most promising approaches have been hybrid in that they have used data-driven approaches informed by models of the perceptual, cognitive, and semantic processes of the human system. Not only does the guidance pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 13 publications
(13 reference statements)
0
1
0
Order By: Relevance
“…However, supervised learning depends on a large amount of manually annotated and time-stamped data for every sound of interest. This is necessary to train models capable of producing reliable predictions [29,23,24,30,31]. Assembling sufficient manually annotated audio data presents challenges mainly due to: 1) the ambiguity in defining the precise beginning and end of bioacoustic events, 2) the need for specialized domain expertise in bioacoustics for accurate annotations, and 3) the typically extended duration of bioacoustic recordings.…”
Section: Introductionmentioning
confidence: 99%
“…However, supervised learning depends on a large amount of manually annotated and time-stamped data for every sound of interest. This is necessary to train models capable of producing reliable predictions [29,23,24,30,31]. Assembling sufficient manually annotated audio data presents challenges mainly due to: 1) the ambiguity in defining the precise beginning and end of bioacoustic events, 2) the need for specialized domain expertise in bioacoustics for accurate annotations, and 3) the typically extended duration of bioacoustic recordings.…”
Section: Introductionmentioning
confidence: 99%
“…In bioacoustics applications, a common approach to both detection and subsequent classification relies on computer vision and deep learning techniques such as Convolutional Neural Networks (CNN) [31,32] or Visual Transformers (ViT) [33,34]. However, this approach has the critical limitation of predominantly relying on supervised learning protocols, which require large quantities of annotated and time-stamped data on every sound of interest to train models that can generate effective predictions [35,29,30,36,37]. Preparing sufficient quantities of manually annotated audio data is challenging due to factors such as: 1) the ambiguity in defining the start and end of bioacoustics events, 2) overlapping events in both time and frequency, 3) the strong requirement for domain expertise in bioacoustics annotation, 4) the typical length of bioacoustics audio recordings, and 5) limited human resources for bioacoustics annotation.…”
Section: Introductionmentioning
confidence: 99%