Joint audio-visual bi-modal codewords for video event detection

Ye, Guangnan; Jhuo, I-Hong; Liu, Dong; Jiang, Yu‐Gang; Lee, D. T.; Chang, Shih-Fu

doi:10.1145/2324796.2324843

Cited by 25 publications

(20 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ye et al [168] proposed a simple and efficient method called bi-modal audio-visual codewords. The bi-modal words were generated using normalized cut on a bipartite graph of visual and audio words, which capture the co-occurrence relations between audio and visual words within the same time window.…”

Section: Audio-visual Joint Representationsmentioning

confidence: 99%

“…Similar to [58], bag-of-words representation was used to convert each of the feature sets into a fixed-dimensional vector. A joint audio-visual bi-modal representation [168] was also explored, which encodes local pattern across the two modalities. Different fusion strategies were used-a fast kernel-based method for early fusion, a Bayesian model combination for optimizing performance at a specific operation point, and weighted average fusion for optimal performance over the entire performance curve.…”

Section: Forums and Recent Approachesmentioning

confidence: 99%

“…In addition to the visual features, another factor that should never be neglected is the audio track of videos, which is very useful as discussed earlier in this paper. Since audio and vision were mostly separately investigated in two different communities, limited research (except [51,168]) has been done on how audio-visual cues can be jointly used to represent video events (cf. Sect.…”

Section: Future Directionsmentioning

confidence: 99%

See 2 more Smart Citations

High-level event recognition in unconstrained videos

Jiang

Bhattacharya

Chang

et al. 2012

Int J Multimed Info Retr

Self Cite

160

108

View full text Add to dashboard Cite

The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by nonprofessionals. Such videos depicting complex events have limited quality control, and therefore, may include severe camera motion, poor lighting, heavy background clutter, and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.

show abstract

Section: Audio-visual Joint Representationsmentioning

confidence: 99%

Section: Forums and Recent Approachesmentioning

confidence: 99%

Section: Future Directionsmentioning

confidence: 99%

See 1 more Smart Citation

High-level event recognition in unconstrained videos

Jiang

Bhattacharya

Chang

et al. 2012

Int J Multimed Info Retr

Self Cite

160

108

View full text Add to dashboard Cite

show abstract

“…In [9], the authors performed a modified probabilistic Latent Semantic Analysis (pLSA) based violence detection from audio cues and visual information by exploiting different concepts (including explosion, motion, blood and flame etc.). Many other methods have been proposed that merge the two modalities of audio and visual information for VSD, e.g., [10][11][12][13]. Other than audio-visual features, some authors also exploited the use of textual information [14,15].…”

Section: Introductionmentioning

confidence: 99%

Violent Scene Detection Using a Super Descriptor Tensor Decomposition

Khokher

Bouzerdoum

Phung

2015

2015 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

View full text Add to dashboard Cite

This article presents a new method for violent scene detection using super descriptor tensor decomposition. Multi-modal local features comprising auditory and visual features are extracted from Mel-frequency cepstral coefficients (including first and second order derivatives) and refined dense trajectories. There is usually a large number of dense trajectories extracted from a video sequence; some of these trajectories are unnecessary and can affect the accuracy. We propose to refine the dense trajectories by selecting only discriminative trajectories in the region of interest. Visual descriptors consisting of oriented gradient and motion boundary histograms are computed along the refined dense trajectories. In traditional bag-of-visual-words techniques, the feature descriptors are concatenated to form a single large feature vector for classification. This destroys the spatio-Temporal interactions among features extracted from multi-modal data. To address this problem, a super descriptor tensor decomposition is proposed. The extracted feature descriptors are first encoded using super descriptor vector method. Then the encoded features are arranged as tensors so as to retain the spatioTemporal structure of the features. To obtain a compact set of features for classification, the TUCKER-3 decomposition is applied to the super descriptor tensors, followed by feature selection using Fisher feature ranking. The obtained features are fed to a support vector machine classifier. Experimental evaluation is performed on violence detection benchmark dataset, MediaEval VSD2014. The proposed method outperforms most of the state-of-The-Art methods, achieving MAP2014 scores of 60.2% and 67.8% on two subsets of the dataset. Abstract-This article presents a new method for violent scene detection using super descriptor tensor decomposition. Multi-modal local features comprising auditory and visual features are extracted from Mel-frequency cepstral coefficients (including first and second order derivatives) and refined dense trajectories. There is usually a large number of dense trajectories extracted from a video sequence; some of these trajectories are unnecessary and can affect the accuracy. We propose to refine the dense trajectories by selecting only discriminative trajectories in the region of interest. Visual descriptors consisting of oriented gradient and motion boundary histograms are computed along the refined dense trajectories. In traditional bag-of-visual-words techniques, the feature descriptors are concatenated to form a single large feature vector for classification. This destroys the spatio-temporal interactions among features extracted from multi-modal data. To address this problem, a super descriptor tensor decomposition is proposed. The extracted feature descriptors are first encoded using super descriptor vector method. Then the encoded features are arranged as tensors so as to retain the spatio-temporal structure of the features. To obtain a compact set of features for classification, the TUCKER-3 de...

show abstract

“…is built from visual and audio modalities, which is later partitioned into bi-modal words that can be also considered as joint patterns across modalities. Consequently, the joint patterns are transformed into bimodal Bag-of-Words representations and considered as input to the classifiers [218]. Similarity scores between queries and the database images are also proposed in [18] …”

Section: What Should Be Fused?mentioning

confidence: 99%

Integrating Deep Learning with Correlation-based Multimedia Semantic Concept Detection

Ha¹

View full text Add to dashboard Cite

Joint audio-visual bi-modal codewords for video event detection

Cited by 25 publications

References 13 publications

High-level event recognition in unconstrained videos

High-level event recognition in unconstrained videos

Violent Scene Detection Using a Super Descriptor Tensor Decomposition

Integrating Deep Learning with Correlation-based Multimedia Semantic Concept Detection

Contact Info

Product

Resources

About