Cross-modal alignment for wildlife recognition

Pattern Recognition Letters

Tuytelaars

Moens

2016

Self Cite

We propose a weakly supervised framework for domain adaptation in a multi-modal context for multi-label classification. This framework is applied to annotate objects such as animals in a target video with subtitles, in the absence of visual demarcators. We start from classifiers trained on external data (the source, in our setting -ImageNet), and iteratively adapt them to the target dataset using textual cues from the subtitles. Experiments on a challenging dataset of wildlife documentaries validate the framework, with a final F 1 measure of approximately 70%, which significantly improves over the results of a state-of-the-art approach, that is, applying classifiers trained on ImageNet without adaptation. The methods proposed here take us a step closer to object recognition in the wild and automatic video indexing.

Section: Introductionmentioning

confidence: 99%

“…Acquiring these bounding boxes by hand is tedious. Therefore, unlike [4], we are interested in a more realistic scenario where the bounding boxes are not available. In the absence of bounding boxes, the problem becomes much more challenging due to the following key issues -First, the presence of an animal is not known.…”

Section: Introductionmentioning

confidence: 99%

Wildlife recognition in nature documentaries with weak supervision from subtitles and external data

Pattern Recognition Letters

Tuytelaars

Moens

2016

Self Cite

“…The dataset used in our experiments is that of (Dusart et al, 2013). This is a wildlife documentary named 'Great Wildlife Moments' 6 with subtitles from the BBC.…”

Section: Experiments and Resultsmentioning

confidence: 99%

“…The problem of aligning animals from videos with their mentions in subtitles has been studied in (Dusart et al, 2013) and (Venkitasubramanian et al, 2016). The former relies on hand-annotated bounding boxes to localize the animals in a frame, which are difficult to acquire.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision

Proceedings of the Sixth Workshop on Vision and Language

Tuytelaars²,

Moens³

2017

Self Cite

We investigate animal recognition models learned from wildlife video documentaries by using the weak supervision of the textual subtitles. This is a challenging setting, since i) the animals occur in their natural habitat and are often largely occluded and ii) subtitles are to a great degree complementary to the visual content, providing a very weak supervisory signal. This is in contrast to most work on integrated vision and language in the literature, where textual descriptions are tightly linked to the image content, and often generated in a curated fashion for the task at hand. We investigate different image representations and models, in particular a support vector machine on top of activations of a pretrained convolutional neural network, as well as a Naive Bayes framework on a 'bag-of-activations' image representation, where each element of the bag is considered separately. This representation allows key components in the image to be isolated, in spite of vastly varying backgrounds and image clutter, without an object detection or image segmentation step. The methods are evaluated based on how well they transfer to unseen camera-trap images captured across diverse topographical regions under different environmental conditions and illumination settings, involving a large domain shift.

Entity linking across vision and language

Tuytelaars

Moens

2017

Multimed Tools Appl

Self Cite

We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modelling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.