Semantic concept based video retrieval using convolutional neural network

Janwe, Nitin; Bhoyar, Kishor K.

doi:10.1007/s42452-019-1870-9

Cited by 7 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Paper [15] proposed a nested invariant pool (NIP) method to obtain compact and discriminative convolutional neural network video retrieval descriptors, which mainly acts on the middle feature mapping of CNN in a nested manner, reducing the feature size of CNN, boosting the geometric invariance of depth descriptors, and significantly improving the performance of video retrieval. In [16] uses a classififier fusing asymmetric training depth CNNs to deal with data imbalance to achieve effificient video retrieval. The dual encoding network proposed in reference [17] encodes inputs in a similar manner.…”

Section: Introductionmentioning

confidence: 99%

Deep Hybrid Neural Network With Attention Mechanism for Video Hash Retrieval Method

2023

IEEE Access

View full text Add to dashboard Cite

With the advent of 5G technology, a large amount of video data will be generated and uploaded to the Internet every day, which makes it an urgent problem to effectively retrieve similar video content from massive video databases. However, most of the existing video retrieval methods usually ignore the time information contained in the video, which causes the problem that the temporal information feature is not obvious when extracting video features. Therefore, this paper proposes a deep hybrid neural network with attention mechanism for video hash retrieval method. The method first feeds the extracted video key frames into the residual network embedded with the attention mechanism and the bidirectional gated recurrent unit network to extract the spatio-temporal information features of the video, so as to improve the representation ability of the features, and then integrates the hash learning technology into the network framework to generate the binary hash code of the video to shorten the video retrieval time. By designing objective function to capture the information of the global data structure using the similarity relationship between codes, the class label information can be effectively used to improve the retrieval performance. Compared with the existing methods, the experimental results show that the method has a certain improvement in the retrieval accuracy on the commonly used video datasets UCF-101 and JHMDB, which verififies the feasibility of the method.

show abstract

Section: Introductionmentioning

confidence: 99%

Deep Hybrid Neural Network With Attention Mechanism for Video Hash Retrieval Method

2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Finally, the classification scores of each shot are fused to obtain the final class score. Similar studies are based on the shot boundary detection and keyframe extraction technique 11,12 . In such studies, video semantics are only considered on the frame level.…”

Section: Introductionmentioning

confidence: 99%

“…Similar studies are based on the shot boundary detection and keyframe extraction technique. 11,12 In such studies, video semantics are only considered on the frame level. There have been various studies considering spatio-temporal characteristics.…”

mentioning

confidence: 99%

A novel multilabel video retrieval method using multiple video queries and deep hash codes

Akbacak

2022

Concurrency and Computation

View full text Add to dashboard Cite

Most videos consist of multiple clips. Namely, videos usually have multiple spatio-temporal information. However, existing video datasets are of single video clips.Additionally, content-based video retrieval methods have been implemented based on a single query clip. However, a single query may not be sufficient for better expressing intention concerning the query. Besides, it narrows the capability to search for multiple semantics. How to deal with retrieving videos having multiple clips has not been adequately investigated. This study proposes a novel multilabel video retrieval method that utilizes multiple video clips as queries. Whatever the number of queries, it transforms a multi-query video retrieval into a single-query video retrieval. The method is independent of both the video features and distance metric types. However, we propose a deep video hashing method to achieve speed and efficiency. Experiments performed on three prepared multilabel video datasets have proven the effectiveness of the proposed method. The source codes of the study are public. K E Y W O R D Scontent-based video retrieval, multi-query optimal point, multiple video queries, objective space, video hashing INTRODUCTIONContent-based video retrieval (CBVR) is the process of retrieving videos from a video database by querying them based on visual and temporal content. The visual content is usually an object, face, color, texture, or scene at the frame level. On the other hand, the temporal content is an action type at the video level. Videos frames, each of which is an image, usually have multilabel characters. At the video level, the temporal relation between frames also expresses a piece of semantic information. 1,2 It is called action type. An activity on a football field other than playing football tends to the football semantic in terms of frame-level. However, at the video level, the action type is not football.Neglecting temporal dependencies is sometimes an obligation, but sometimes a deficiency. [3][4][5] For face or object retrieval, for example, it is not necessitated. [6][7][8][9] However, some CBVR studies fundamentally ignore temporal properties. Janwe et al. used video segmentation to get the video shots. 10 From each shot, keyframes that adequately represent the semantic meaning of each shot are extracted. Then the keyframe features are extracted and are trained in a multi-stream CNN classifier. Finally, the classification scores of each shot are fused to obtain the final class score. Similar studies are based on the shot boundary detection and keyframe extraction technique. 11,12 In such studies, video semantics are only considered on the frame level. There have been various studies considering spatio-temporal characteristics. [13][14][15][16][17] On the other hand, few studies only consider actions but ignore pixel-level features. 14,18 The most concise video scene with at least one semantic meaning at the video level is called a clip. Naturally, videos made up of clips are multilabel. The use of a single video clip qu...

show abstract