Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval 2007
DOI: 10.1145/1290082.1290118
|View full text |Cite
|
Sign up to set email alerts
|

Large-scale multimodal semantic concept detection for consumer video

Abstract: In this paper we present a systematic study of automatic classification of consumer videos into a large set of diverse semantic concept classes, which have been carefully selected based on user studies and extensively annotated over 1300+ videos from real users. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon globa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

3
71
0
1

Year Published

2008
2008
2019
2019

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 105 publications
(76 citation statements)
references
References 10 publications
(15 reference statements)
3
71
0
1
Order By: Relevance
“…A few studies also included audio features [25], [26], [91] and found significant advantage in doing so. Multimodal multisource event detection is likely to receive more attention in the near future due to many emerging applications and the availability of large multimodal collections.…”
Section: ) Discussionmentioning
confidence: 99%
“…A few studies also included audio features [25], [26], [91] and found significant advantage in doing so. Multimodal multisource event detection is likely to receive more attention in the near future due to many emerging applications and the availability of large multimodal collections.…”
Section: ) Discussionmentioning
confidence: 99%
“…We also use the challenging Kodak's consumer video data set provided in [13], [21] for evaluation. Unlike the Caltech images, content in this raw video source involves more variations in imaging conditions (view, scale, lighting) and scene complexity (background and number of objects).…”
Section: Performance Over Consumer Videosmentioning
confidence: 99%
“…To explore complementary features from both audio and visual channels, we extract similar features as [21]: visual features, e.g., grid color moments, Gabor texture, edge direction histogram, from keyframes, resulting in 346-dimension visual feature vectors; Mel-Frequency Cepstral Coefficients (MFCCs) from each audio frame (10ms) and delta MFCCs from neighboring frames. Over the video interval associated with each keyframe, the mean and covariance of the audio frame features are computed to generate a 2550-dimension audio feature vector .…”
Section: Performance Over Consumer Videosmentioning
confidence: 99%
“…Previous work has explored detecting semantic concepts in multiple modalities to capture the embedded semantics in the multimedia stream [1] [2] [3]. In this paper, we limit ourselves to discuss the audio semantic concepts.…”
Section: Introductionmentioning
confidence: 99%