The indexing of broadcast TV archives is a current problem in multimedia research. As the size of these databases grows continuously, meaningful features are needed to describe and connect their elements efficiently, such as the identification of speaking faces. In this context, this paper focuses on two approaches for unsupervised person discovery. Initial tagging of speaking faces is provided by an OCR-based method, and these tags propagate through a graph model based on audiovisual relations between speaking faces. Two propagation methods are proposed, one based on random walks and the other based on a hierarchical approach. To better evaluate their performances, these methods were compared with two graph clustering baselines. We also study the impact of different modality fusions on the graph-based tag propagation scenario. From a quantitative analysis, we observed that the graph propagation techniques always outperform the baselines. Among all compared strategies, the methods based on hierarchical propagation with late fusion and random walk with score-fusion obtained the highest MAP values. Finally, even though these two methods produce highly equivalent results according to Kappa coefficient, the random walk method performs better according to a paired t-test, and the computing time for the hierarchical propagation is more than 4 times lower than the one for the random walk propagation.
The rapid growth of multimedia databases and the human interest in their peers make indices representing the location and identity of people in audio-visual documents essential for searching archives. Person discovery in the absence of prior identity knowledge requires accurate association of audio-visual cues and detected names. To this end, we present 3 different strategies to approach this problem: clustering-based naming, verification-based naming, and graph-based naming. Each of these strategies utilizes different recent advances in unsupervised face / speech representation, verification, and optimization. To have a better understanding of the approaches, this paper also provides a quantitative and qualitative comparative study of these approaches using the associated corpus of the Person Discovery challenge at MediaEval 2016. From the results of our experiments, we can observe the pros and cons of each approach, thus paving the way for future promising research directions.
TV archives are growing in size so fast that manually indexing becomes unfeasible. Automatic indexing techniques can be applied to overcome this issue, and this work proposes an unsupervised technique for multimodal person discovery. To achieve this goal, we propose a hierarchical label propagation technique based on quasi-flat zones theory, that learns from labeled and unlabeled data and propagates names through a multimodal graph representation. In this representation, we combine audio, video, and text processing techniques to model the data as a graph of speaking faces. In the proposed modeling, we extract names via optical character recognition and propagate them through the graph using audiovisual relationships between speaking faces. We also use a random walk label propagation and two graph clustering strategies to serve as baselines. The proposed label propagation techniques always outperform the clustering baselines on the quantitative assessments. Our approach also outperforms all literature methods tested on the same dataset except for one, which uses a different preprocessing step. The proposed hierarchical label propagation and the random walk baseline produce highly equivalent results according to the Kappa coefficient, but the hierarchical propagation is parameter-free and over 9 times faster than the random walk under the same configurations.
The amount of applications using unstructured data, like videos, has been increased, and the researches concerning multimedia retrieval have attracted great attention. The need to efficiently index and retrieve this kind of data is of great concern, due to the fact that common searching approaches based on the use of keywords are not adequate for large video databases. Similarity search is a content based approach and it has been successfully used in retrieval systems. Accordingly, a major challenge is to provide an accurate and compact video representation that can achieve good performance with a fast answer in this type of searching. In this work, we proposed a compact video representation by using Min-Hash and the k-nearest GIST descriptors. Furthermore, we also present the first use of BossaNova Video Descriptor (BNVD) to video similarity search. Both compact video representations have shown more than 88% of mean average precision on similarity video search. The experimental results indicate high efficiency of our proposed representations in video retrieval task.
Image segmentation is an ill-posed problem by definition, as it is not always possible to automatically select which object appearing in an image is the object of interest. To deal with this issue, prior knowledge in the form of human-given markers can be included in the segmentation pipeline. Even though user interaction can drastically improve segmentation results, it is an expensive resource, and finding ways to reduce human effort on an interactive segmentation loop is of great interest. In this work, we propose a new segmentation layer to be used with deep neural networks, which allows us to create and train in an endto-end fashion a marker creation network. To train the network, we propose a loss function composed of: a segmentation loss using the proposed differentiable segmentation layer; and a set of regularization functions that enforce the desired characteristics on the produced markers. We showed that by using the proposed layer and loss function, we can train the network to automatically generate markers that recover a good segmentation and have desirable shape characteristics. This behavior is observed on the training dataset, as well as on four unseen datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.