Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10670
|View full text |Cite
|
Sign up to set email alerts
|

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Abstract: Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visua… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(8 citation statements)
references
References 34 publications
0
8
0
Order By: Relevance
“…In our experiment, we selected 33 types of data with high rater confidence scores (0.7, 1.0), resulting in a training set of 82,410 audiovisual clips. For a fair comparison with the baseline method, we used the original evaluation set, which contained 85,487 test clips [69]. The dataset was split into three sets for training: a train set (70%), an evaluation set (10%), and a test set (20%).…”
Section: Datasetmentioning
confidence: 99%
See 2 more Smart Citations
“…In our experiment, we selected 33 types of data with high rater confidence scores (0.7, 1.0), resulting in a training set of 82,410 audiovisual clips. For a fair comparison with the baseline method, we used the original evaluation set, which contained 85,487 test clips [69]. The dataset was split into three sets for training: a train set (70%), an evaluation set (10%), and a test set (20%).…”
Section: Datasetmentioning
confidence: 99%
“…VAED [69] uses heterogeneous graphs to explicitly capture the relationships between modalities, providing detailed information about the underlying signal.…”
Section: Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…In our past work [15], we noted that hetereogenous audiovisual graphs can effectively capture the relationship within and across audio and visual modalities, which can outperform other multimodal learning approaches. However, the success of this approach, to a large extent, relies on constructing the 'right' graph.…”
Section: Introductionmentioning
confidence: 99%
“…Our model, HGCN, thus allows for both independent processing of each modality and fusing information in the crossmodal layer. The idea presented in this paper is significantly different from previous graph-based approaches used for representation learning [15,18] as it avoids manually connecting nodes and makes end-to-end learning possible. In summary, our contributions are as follows:…”
Section: Introductionmentioning
confidence: 99%