2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2011
DOI: 10.1109/icassp.2011.5946962
|View full text |Cite
|
Sign up to set email alerts
|

Automatic video annotation via Hierarchical Topic Trajectory Model considering cross-modal correlations

Abstract: We propose a new statistical model, named Hierarchical Topic Trajectory Model (HTTM), for acquiring a dynamically changing topic model that represents the relationship between video frames and associated text labels. Model parameter estimation, annotation and retrieval can be executed within a unified framework with a few computation. It is also easy to add new modals such as audio signal and geotags. Preliminary experiments on video annotation task with manually annotated video dataset indicate that our propo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(1 citation statement)
references
References 23 publications
0
1
0
Order By: Relevance
“…Thus, it is important to have precise spectrum representation with maximum information gain and no redundancy. For example, in the case of image and text, images can be Image retrieval in different categories 2014, 2019 [38], [39] 24 distinct category image-text retrieval 2012 [40] Disaster and emergency management 2016 [25] image-text retrieval in various categories 2017, 2018 [41], [42] 10 I, T, V Image-text and video-text retrieval in multiple categories 2015 [43] Video, image and text retrieval in video lectures 2014 [44] 11 T, V Multiple concepts' video annotation 2011 [45] cooking activities' video annotation, videos' temporal activity localization evaluation, personal videos' annotation 2019 [46] Cooking recipe retrieval 2019 [28] represented in spatial or spectral while the text is symbolic and dependent upon grammar rules and cultural norms [2].…”
Section: Challengesmentioning
confidence: 99%
“…Thus, it is important to have precise spectrum representation with maximum information gain and no redundancy. For example, in the case of image and text, images can be Image retrieval in different categories 2014, 2019 [38], [39] 24 distinct category image-text retrieval 2012 [40] Disaster and emergency management 2016 [25] image-text retrieval in various categories 2017, 2018 [41], [42] 10 I, T, V Image-text and video-text retrieval in multiple categories 2015 [43] Video, image and text retrieval in video lectures 2014 [44] 11 T, V Multiple concepts' video annotation 2011 [45] cooking activities' video annotation, videos' temporal activity localization evaluation, personal videos' annotation 2019 [46] Cooking recipe retrieval 2019 [28] represented in spatial or spectral while the text is symbolic and dependent upon grammar rules and cultural norms [2].…”
Section: Challengesmentioning
confidence: 99%