ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

Gkalelis, Nikolaos; Goulas, Andreas; Galanopoulos, Damianos; Mezaris, Vasileios

doi:10.1109/cvprw53098.2021.00376

Cited by 10 publications

(33 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The introduction of deep learning approaches has offered major performance leaps in video event recognition [5]- [14]. Most of these methods operate in a top-down fashion [6], [7], [10]- [14], i.e.…”

Section: Introductionmentioning

confidence: 99%

“…Motivated by cognitive and psychological studies as described above, recent bottom-up action and event recognition approaches [5], [9] represent a video frame using not only features extracted from the entire frame but also features representing the main objects of the frame. More specifically, they utilize an object detector to derive a set of objects depicting semantically coherent regions of the video frames, a backbone network to derive a feature representation of these objects, and an attention mechanism combined with a graph neural network (GNN) to classify the video.…”

Section: Introductionmentioning

confidence: 99%

“…7 and the related ablation study concerning the effect of the number of frames in the action recognition performance). In [5], the 3D-CNN backbone of [9] is replaced by a 2D-CNN (i.e. ResNet [31]), and an attention mechanism [32] with a GNN are used to encode the bottomup spatial information at each frame only; the sequence of feature vectors is then processed by an LSTM [33] to classify the video.…”

Section: Introductionmentioning

confidence: 99%

“…Therefore, in contrast to [9], the above architecture factorizes the processing of the video along the spatial and temporal dimension, thus, effectively removing the memory restrictions imposed in [9] by the use of expensive 3D-CNN and the construction of the large spatiotemporal attention matrix. Moreover, the authors in [5] make a first attempt at exploiting the weighted in-degrees (WiDs) of the graph convolutional network's (GCN's) adjacency matrix to propose eXplainable AI (XAI) criteria and provide object-level (i.e., spatial) explanations concerning the recognized event [5]. However, despite the fact that this architecture can process long sequences of video frames, it is well known that the LSTM struggles to model long-term temporal dependencies [10]- [12], [14], [16], [29], [30].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

Self Cite

View full text Add to dashboard Cite

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, MiniKinetics, ActivityNet) a .a Source code and trained models will be made available upon acceptance.INDEX TERMS Video event recognition, eXplainable AI (XAI), graph attention network, factorized attention, bottom-up.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

Self Cite

View full text Add to dashboard Cite

show abstract

“…Graph Neural Networks (GNNs) have achieved state-of-the-art performance in learning over such relational data in various graph-based machine learning tasks, such as node classification, link prediction, and graph classification [9,20,29,61,68,69]. Due to their superior performance, GNNs are now widely used in many applications such as recommendation systems, credit issuing, traffic forecasting, drug discovery, and medical diagnosis [3,7,17,27,37,63].…”

Section: Introductionmentioning

confidence: 99%

GAP: Differentially Private Graph Neural Networks with Aggregation Perturbation

Sajadmanesh¹,

Shamsabadi²,

Bellet³

et al. 2022

Preprint

View full text Add to dashboard Cite

Graph Neural Networks (GNNs) are powerful models designed for graph data that learn node representation by recursively aggregating information from each node's local neighborhood. However, despite their state-of-the-art performance in predictive graph-based applications, recent studies have shown that GNNs can raise significant privacy concerns when graph data contain sensitive information. As a result, in this paper, we study the problem of learning GNNs with Differential Privacy (DP). We propose GAP, a novel differentially private GNN that safeguards the privacy of nodes and edges using aggregation perturbation, i.e., adding calibrated stochastic noise to the output of the GNN's aggregation function, which statistically obfuscates the presence of a single edge (edge-level privacy) or a single node and all its adjacent edges (node-level privacy). To circumvent the accumulation of privacy cost at every forward pass of the model, we tailor the GNN architecture to the specifics of private learning. In particular, we first precompute private aggregations by recursively applying neighborhood aggregation and perturbing the output of each aggregation step. Then, we privately train a deep neural network on the resulting perturbed aggregations for any node-wise classification task. A major advantage of GAP over previous approaches is that we guarantee edgelevel and node-level DP not only for training, but also at inference time with no additional costs beyond the training's privacy budget. We theoretically analyze the formal privacy guarantees of GAP using Rényi DP. Empirical experiments conducted over three real-world graph datasets demonstrate that GAP achieves a favorable privacy-accuracy trade-off and significantly outperforms existing approaches.

show abstract