Proceedings of the 2nd ACM International Conference on Multimedia Retrieval 2012
DOI: 10.1145/2324796.2324843
|View full text |Cite
|
Sign up to set email alerts
|

Joint audio-visual bi-modal codewords for video event detection

Abstract: Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2012
2012
2017
2017

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(20 citation statements)
references
References 13 publications
0
20
0
Order By: Relevance
“…Ye et al [168] proposed a simple and efficient method called bi-modal audio-visual codewords. The bi-modal words were generated using normalized cut on a bipartite graph of visual and audio words, which capture the co-occurrence relations between audio and visual words within the same time window.…”
Section: Audio-visual Joint Representationsmentioning
confidence: 99%
See 2 more Smart Citations
“…Ye et al [168] proposed a simple and efficient method called bi-modal audio-visual codewords. The bi-modal words were generated using normalized cut on a bipartite graph of visual and audio words, which capture the co-occurrence relations between audio and visual words within the same time window.…”
Section: Audio-visual Joint Representationsmentioning
confidence: 99%
“…Similar to [58], bag-of-words representation was used to convert each of the feature sets into a fixed-dimensional vector. A joint audio-visual bi-modal representation [168] was also explored, which encodes local pattern across the two modalities. Different fusion strategies were used-a fast kernel-based method for early fusion, a Bayesian model combination for optimizing performance at a specific operation point, and weighted average fusion for optimal performance over the entire performance curve.…”
Section: Forums and Recent Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…In [9], the authors performed a modified probabilistic Latent Semantic Analysis (pLSA) based violence detection from audio cues and visual information by exploiting different concepts (including explosion, motion, blood and flame etc.). Many other methods have been proposed that merge the two modalities of audio and visual information for VSD, e.g., [10][11][12][13]. Other than audio-visual features, some authors also exploited the use of textual information [14,15].…”
Section: Introductionmentioning
confidence: 99%
“…is built from visual and audio modalities, which is later partitioned into bi-modal words that can be also considered as joint patterns across modalities. Consequently, the joint patterns are transformed into bimodal Bag-of-Words representations and considered as input to the classifiers [218]. Similarity scores between queries and the database images are also proposed in [18] …”
Section: What Should Be Fused?mentioning
confidence: 99%