Self-Supervised Graphs for Audio Representation Learning With Limited Labeled Data

Shirian, Amir; Somandepalli, Krishna; Guha, Tanaya

doi:10.1109/jstsp.2022.3190083

Cited by 4 publications

(2 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Spectrogram-VGG model is the same as the configuration A in [34] with only one change: the final layer is a softmax with 33 units. The feature for each audio input to [30] 0.39 ± 0.02 -87M SSL graph [31] 0.42 ± 0.02 -218K Wave-Logmel [32] 0.43 ± 0.04 -81M AST [33] 0.44 ± 0.00 -88M VAED [15] 0.50 ± 0.01 0.93 ± 0.00 2.1M…”

Section: Results and Analysismentioning

confidence: 99%

“…The VATT [30] is a self-supervised multimodal transformer with a modality-agnostic, single-backbone Transformer and sharing weights between audio and video modality. We also compared our method with recent graph-based works [31,15]. The wave-Logmel [32] is a supervised CNN model which takes waveform and log mel spectrogram at the same time as input.…”

Section: Results and Analysismentioning

confidence: 99%

See 1 more Smart Citation

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

SHIRIAN¹,

Somandepalli²,

Sánchez³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs. Through heterogeneous graphs, we show efficiently modelling of intra-and inter-modality relationships both at spatial and temporal scales. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance. Our code is available at github.com/AmirSh15/VAED HeterGraph

show abstract

Section: Results and Analysismentioning

confidence: 99%

Section: Results and Analysismentioning

confidence: 99%

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

SHIRIAN¹,

Somandepalli²,

Sánchez³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

Heterogeneous Graph Learning for Acoustic Event Classification

Shirian

Ahmadian

Somandepalli

et al. 2023

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Heterogeneous graphs provide a compact, efficient, and scalable way to model data involving multiple disparate modalities. This makes modeling audiovisual data using heterogeneous graphs an attractive option. However, graph structure does not appear naturally in audiovisual data. Graphs for audiovisual data are constructed manually which is both difficult and sub-optimal. In this work, we address this problem by (i) proposing a parametric graph construction strategy for the intra-modal edges, and (ii) learning the crossmodal edges. To this end, we develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges. Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes across modalities. Experiments on a large benchmark dataset (AudioSet) show that our model is state-of-the-art (0.53 mean average precision), outperforming transformer-based models and other graph-based models. Our code is available at github.com/AmirSh15/Cross modality graph

show abstract