“…In computer vision, GCNs have been successfully applied to scene graph generation [22,31,38,52,56], 3D understanding [16,29,49,51], and action recognition in video [20,53,55]. In MAAS we desing a DeepGCN-like architecture [27,28,30], that addresses a special scenario, namely the multi-modal nature of audiovisual data. We rely on the well known EdgeConv operator [49], to model interactions between different modalities on graph nodes identified across multiple frames.…”