We propose a weakly supervised framework for domain adaptation in a multi-modal context for multi-label classification. This framework is applied to annotate objects such as animals in a target video with subtitles, in the absence of visual demarcators. We start from classifiers trained on external data (the source, in our setting -ImageNet), and iteratively adapt them to the target dataset using textual cues from the subtitles. Experiments on a challenging dataset of wildlife documentaries validate the framework, with a final F 1 measure of approximately 70%, which significantly improves over the results of a state-of-the-art approach, that is, applying classifiers trained on ImageNet without adaptation. The methods proposed here take us a step closer to object recognition in the wild and automatic video indexing.
We propose an unsupervised framework for recognizing animals in videos using subtitles. In this framework, the alignment between animals and their names is performed using an Expectation Maximization algorithm which is adapted to two very different circumstances-1) when the bounding boxes are available and 2) when the frame as a whole is used instead of bounding boxes. With the goal of maximizing precision, recall and F-measure, the experiments compare a multitude of natural language processing approaches and visual features when associating animal names in the subtitles with visual patterns. The proposed unsupervised methods obtain 83.1% F1 using bounding boxes and 65.7% F1 without bounding boxes in a fully automated pipeline.
Multimodal question answering in the cultural heritage domain allows visitors to museums, landmarks or other sites to ask questions in a more natural way. This in turn provides better user experiences. In this paper, we propose the construction of a golden standard dataset dedicated to aiding research into multimodal question answering in the cultural heritage domain. The dataset, soon to be released to the public, contains multimodal content about the fascinating old-Egyptian Amarna period, including images of typical artworks, documents about these artworks (containing images) and over 800 multimodal queries integrating visual and textual questions. The multimodal questions and related documents are all in English. The multimodal questions are linked to relevant paragraphs in the related documents that contain the answer to the multimodal query.
We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modelling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.
Multimedia data grow day by day which makes it necessary to index them automatically and efficiently for fast retrieval, and more precisely to automatically index them with key events. In this paper, we present preliminary work on key event detection in British royal wedding videos using automatic speech recognition (ASR) and visual data. The system first automatically acquires key events of royal weddings from an external corpus such as Wikipedia, and then identifies those events in the ASR data. The system also models name and face alignment to identify the persons involved in the wedding events. We compare the results obtained with the ASR output with results obtained with subtitles. The error is only slightly higher when using ASR output in the detection of key events and their participants in the wedding videos compared to the results obtained with subtitles.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.