Visually-aware Acoustic Event Detection using Heterogeneous Graphs

SHIRIAN, AMIR; Somandepalli, Krishna; Sánchez, Vı́ctor; Guha, Tanaya

doi:10.21437/interspeech.2022-10670

Cited by 2 publications

(8 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our experiment, we selected 33 types of data with high rater confidence scores (0.7, 1.0), resulting in a training set of 82,410 audiovisual clips. For a fair comparison with the baseline method, we used the original evaluation set, which contained 85,487 test clips [69]. The dataset was split into three sets for training: a train set (70%), an evaluation set (10%), and a test set (20%).…”

Section: Datasetmentioning

confidence: 99%

“…VAED [69] uses heterogeneous graphs to explicitly capture the relationships between modalities, providing detailed information about the underlying signal.…”

Section: Baselinesmentioning

confidence: 99%

“…It wasn't until recently that a small amount of research [67,69] introduce the idea of graph construction for modeling audio-visual data, which is exciting. Specifically, by slicing audio and video into segments, each segment can be consider as a node, and the correspondence between them can be consider as edges or relations.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Liu,

Liang,

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

Audiovisual data is everywhere in this digital age, which raises higher requirements for the deep learning models developed on them. To well handle the information of the multi-modal data is the key to a better audiovisual modal. We observe that these audiovisual data naturally have temporal attributes, such as the time information for each frame in the video. More concretely, such data is inherently multi-modal according to both audio and visual cues, which proceed in a strict chronological order. It indicates that temporal information is important in multi-modal acoustic event modeling for both intra-and inter-modal. However, existing methods deal with each modal feature independently and simply fuse them together, which neglects the mining of temporal relation and thus leads to sub-optimal performance. With this motivation, we propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac, by modeling such temporal information via graph learning techniques. In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments. Each segment can be considered as a node, and the temporal relationships between nodes can be considered as timestamps on their edges. In this case, we can smoothly capture the dynamic information in intra-modal and inter-modal. Several experiments are conducted to demonstrate

show abstract

Section: Datasetmentioning

confidence: 99%

“…VAED [69] uses heterogeneous graphs to explicitly capture the relationships between modalities, providing detailed information about the underlying signal.…”

Section: Baselinesmentioning

confidence: 99%

See 1 more Smart Citation

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Liu,

Liang,

et al. 2023

Proceedings of the 31st ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…In our past work [15], we noted that hetereogenous audiovisual graphs can effectively capture the relationship within and across audio and visual modalities, which can outperform other multimodal learning approaches. However, the success of this approach, to a large extent, relies on constructing the 'right' graph.…”

Section: Introductionmentioning

confidence: 99%

“…Our model, HGCN, thus allows for both independent processing of each modality and fusing information in the crossmodal layer. The idea presented in this paper is significantly different from previous graph-based approaches used for representation learning [15,18] as it avoids manually connecting nodes and makes end-to-end learning possible. In summary, our contributions are as follows:…”

Section: Introductionmentioning

confidence: 99%

Heterogeneous Graph Learning for Acoustic Event Classification

Shirian

Ahmadian

Somandepalli

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address this problem by (i) proposing a parametric graph construction strategy for the intra-modal edges, and (ii) learning the crossmodal edges. To this end, we develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges. Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes across modalities. Experiments on a large benchmark dataset (AudioSet) show that our model is state-of-the-art (0.53 mean average precision), outperforming transformer-based models and other graph-based models. Our code is available at github.com/AmirSh15/Cross modality graph

show abstract

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Cited by 2 publications

References 34 publications

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification

Heterogeneous Graph Learning for Acoustic Event Classification

Contact Info

Product

Resources

About