Pivot Correlational Neural Network for Multimodal Video Categorization

Kang, Sunghun; Kim, Junyeong; Choi, Hyun-Soo; Kim, Sungjin; Yoo, Chang D.

doi:10.1007/978-3-030-01264-9_24

Cited by 12 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed approach is evaluated on FCVID using the mean average precision (mAP) and compared against the top-scoring approaches of the literature, i.e. PivotCorrNN [15], LiteEval [30], AdaFrame [31], SCSampler [17], ST-VLAD [22] and AR-Net [19]. On YLI-MED, the top-1 accuracy is utilized, and the comparison is performed against the top-scoring literature approaches for this dataset, i.e.…”

Section: Resultsmentioning

confidence: 99%

“…Spatiotemporal VLAD (ST-VLAD) is presented in [22], encoding convolutional features across different segments to represent the video. In [15], PivotCor-rNN is proposed, exploiting correlations among different video modalities. S2L is introduced in [32], utilizing a pretrained ResNet and an LSTM to model separately the spatial and temporal video information.…”

Section: Related Workmentioning

confidence: 99%

“…The training is performed using Adam optimizer, batch size 16, exponential schedule with initial learning rate 10 −4 , decay factor 0.9 at every epoch, and 30 epochs in total. mAP(%) ST-VLAD [22] 77.5 PivotCorrNN [15] 77.6 LiteEval [30] 80.0 AdaFrame [31] 80.2 SCSampler [17] 81.0 AR-Net (ResNet backbone) [19] 81.3 AR-Net (EfficientNet backbone) [19] 84.4 ObjectGraphs (proposed; ResNet backbone) 84.6…”

Section: Setupmentioning

confidence: 99%

“…improved dense trajectories [25]. ii) C2D: Techniques that utilize deep convolutional neural networks (DCNNs) with 2D convolutional kernels to extract the static event-related information at frame-level, and subsequently utilize an appropriate technique to capture the temporal dynamics of the event [26,22,15,32,30,31,17,19]. iii) C3D: DCNNs that use 3D convolutional kernels to encode simultaneously the spatiotemporal event information in videos [24,28,8].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

Gkalelis

Goulas

Galanopoulos

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

In this paper a novel bottom-up video event recognition approach is proposed, ObjectGraphs, which utilizes a rich frame representation and the relations between objects within each frame. Following the application of an object detector (OD) on the frames, graphs are used to model the object relations and a graph convolutional network (GCN) is utilized to perform reasoning on the graphs. The resulting object-based frame-level features are then forwarded to a long short-term memory (LSTM) network for video event recognition. Moreover, the weighted in-degrees (WiDs) derived from the graph's adjacency matrix at frame level are used for identifying the objects that were considered most (or least) salient for event recognition and contributed the most (or least) to the final event recognition decision, thus providing an explanation for the latter. The experimental results show that the proposed method achieves state-ofthe-art performance on the publicly available FCVID and YLI-MED datasets 1 .

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

Gkalelis

Goulas

Galanopoulos

et al. 2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

show abstract

“…We can find that our model achieves higher event recognition performance compared with some Appearance-based methods on both datasets. Besides, our model is also better than Pivot CorrNN [19], which uses seven types of pre-extracted features to perform event recognition.…”

Section: Comparison To Start Of the Artmentioning

confidence: 97%

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Event analysis in untrimmed videos has attracted increasing attention due to the application of cutting-edge techniques such as CNN. As a well studied property for CNN-based models, the receptive field is a measurement for measuring the spatial range covered by a single feature response, which is crucial in improving the image categorization accuracy. In video domain, video event semantics are actually described by complex interaction among different concepts, while their behaviors vary drastically from one video to another, leading to the difficulty in concept-based analytics for accurate event categorization. To model the concept behavior, we study temporal concept receptive field of concept-based event representation, which encodes the temporal occurrence pattern of different mid-level concepts. Accordingly, we introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics. TDC can adjust the temporal concept receptive field size dynamically according to different inputs. Notably, a set of coefficients are learned to fuse the results of multiple convolutions with different kernel widths that provide various temporal concept receptive field sizes. Different coefficients can generate appropriate and accurate temporal concept receptive field size according to input videos and highlight crucial concepts. Based on TDC, we propose the temporal dynamic concept modeling network (TDCMN) to learn an accurate and complete concept representation for efficient untrimmed video analysis. Experiment results on FCVID and ActivityNet show that TDCMN demonstrates adaptive event recognition ability conditioned on different inputs, and improve the event recognition performance of Concept-based methods by a large margin. Code is available at https://github.com/qzhb/TDCMN. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding; Knowledge representation and reasoning.

show abstract

An Adaptive Framework for Anomaly Detection in Time-Series Audio-Visual Data

Kumari

Saini

2022

IEEE Access

View full text Add to dashboard Cite

Anomaly detection is an integral part of a number of surveillance applications. However, most of the existing anomaly detection models are statically trained on pre-recorded data from a single source, thus making multiple assumptions about the surrounding environment. As a result, their usefulness is limited to controlled scenarios. In this paper, we fuse information from live streams of audio and video data to detect anomalies in the captured environment. We train a deep learning-based teacher-student network using video, image, and audio information. The pre-trained visual network in the teacher model distills its information to the image and audio networks in the student model. Features from image and audio networks are combined and compressed using principal component analysis. Thus, the teacher-student network produces an image-audio-based light-weight joint representation of the data. The data dynamics are learned in a multivariate Adaptive Gaussian Mixture Model. Empirical results from two audio-visual datasets demonstrate the effectiveness of joint representation over single modalities in the adaptive anomaly detection framework. The proposed framework outperforms the state-of-the-art methods by an average of 15.00 % and 14.52 % for dataset 1 and dataset 2, respectively.INDEX TERMS Adaptive learning, concept drift, long term surveillance, multimodal anomaly detection, unsupervised model.

show abstract

Pivot Correlational Neural Network for Multimodal Video Categorization

Cited by 12 publications

References 17 publications

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-up Recognition and Explanation of Events in Video

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

An Adaptive Framework for Anomaly Detection in Time-Series Audio-Visual Data

Contact Info

Product

Resources

About