2019
DOI: 10.1109/tcsvt.2018.2848458
|View full text |Cite
|
Sign up to set email alerts
|

Implicit and Explicit Concept Relations in Deep Neural Networks for Multi-Label Video/Image Annotation

Abstract: In this work we propose a DCNN (Deep Convolutional Neural Network) architecture that addresses the problem of video/image concept annotation by exploiting concept relations at two different levels. At the first level, we build on ideas from multi-task learning, and propose an approach to learn conceptspecific representations that are sparse, linear combinations of representations of latent concepts. By enforcing the sharing of the latent concept representations, we exploit the implicit relations between the ta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
35
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2
1

Relationship

3
3

Authors

Journals

citations
Cited by 45 publications
(35 citation statements)
references
References 47 publications
(118 reference statements)
0
35
0
Order By: Relevance
“…To obtain the annotation scores for the 1000 ImageNet concepts, we used an ensemble method, averaging the concept scores from four pre-trained models that employ different DCNN architectures, namely the VGG16, InceptionV3, InceptionResNetV2, as well as a hybrid model that combines the ImageNet and Places365 concept pools [6]. To obtain scores for the 345 TRECVID SIN concepts, we used the deep learning framework of [7]. For the event-related concepts we used the pre-trained model of EventNet [8] while for the action-related concepts we used a model trained on the AVA dataset [9].…”
Section: Concept-based Retrievalmentioning
confidence: 99%
“…To obtain the annotation scores for the 1000 ImageNet concepts, we used an ensemble method, averaging the concept scores from four pre-trained models that employ different DCNN architectures, namely the VGG16, InceptionV3, InceptionResNetV2, as well as a hybrid model that combines the ImageNet and Places365 concept pools [6]. To obtain scores for the 345 TRECVID SIN concepts, we used the deep learning framework of [7]. For the event-related concepts we used the pre-trained model of EventNet [8] while for the action-related concepts we used a model trained on the AVA dataset [9].…”
Section: Concept-based Retrievalmentioning
confidence: 99%
“…The analysis is over 30 times faster than real-time processing and results in a rich set of video fragments that can be used for fine-grained video annotation. The concept-based annotation of the defined video fragments is performed using a combination of deep learning methods (presented in [8] and [4]), which evaluate the appearance of 150 high-level concepts from the TRECVID SIN task [6] in the visual content of the corresponding keyframes. Two pre-trained Ima-geNet [10] deep convolutional neural networks (DCNNs), have been fine-tuned (FT) using the extension strategy of [8].…”
Section: News Video Annotationmentioning
confidence: 99%
“…Two pre-trained Ima-geNet [10] deep convolutional neural networks (DCNNs), have been fine-tuned (FT) using the extension strategy of [8]. Similar to [4], the networks' loss function has been extended with an additional concept correlation cost term; giving a higher penalty to pairs of concepts that present positive correlations but have been assigned with different scores, and the same penalty to pairs of concepts that present negative correlation but have not been assigned with opposite scores. The exact instantiation of the used approach is as follows: Resnet1k-50 [3] extended with one extension FC layer with size equal to 4096 and GoogLeNet [13] trained on 5055 ImageNet concepts [10], extended with one extension FC layer of size equal to 1024.…”
Section: News Video Annotationmentioning
confidence: 99%
See 1 more Smart Citation
“…Non-lecture videos are fragmented to video shots using the method of Apostolidis and Mezaris [2], which detects both abrupt and gradual transitions by appropriately assessing the visual similarity of neighboring frames of the video. Then, a representative keyframe is extracted from each shot and is annotated with concepts from a pre-specified concept pool [7]. This method is based on a deep learning architecture that exploits concept relations at two different levels to learn to more accurately detect the concepts in the video.…”
Section: Main Features Of the Moving Platformmentioning
confidence: 99%