Discover and Learn New Objects from Documentaries

Chen, Kai; Song, Hang; Loy, Chen Change; Lin, Dahua

doi:10.1109/cvpr.2017.124

Cited by 23 publications

(15 citation statements)

References 41 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, in this work no manually annotated visual data is involved at any stage of our approach. To avoid labelling visual data, several approaches have leveraged audio transcripts obtained from narrated videos using automatic speech recognition (ASR) as a way to supervise video models for object detection [3,15,54], captioning [33,69], classification [2,42,47,86], summarization [57] or retrieval [50] using large-scale narrated video datasets such as How2 [65] or HowTo100M [50]. Others [10,30] have investigated learning from narrated videos by directly using the raw speech waveform instead of generating transcriptions.…”

Section: Related Workmentioning

confidence: 99%

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Miech,

Alayrac,

Smaira

et al. 2019

Preprint

View full text Add to dashboard Cite

Timeyou have a little pressure you are cutting the wood readjusting the table saw I am using a roller sure you applied glue Figure 1: We describe an efficient approach to learn visual representations from highly misaligned and noisy narrations automatically extracted from instructional videos. Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.

show abstract

Section: Related Workmentioning

confidence: 99%

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Miech,

Alayrac,

Smaira

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…Multiple instance learning (MIL) [7,33] methods have been used for learning weakly supervised tasks such as object localization (WSOL) [25,8,53,41]. In a standard MIL framework, instance labels in each positive bag are treated as hidden variables with the constraint that at least one of them should be positive.…”

Section: Related Workmentioning

confidence: 99%

Learning to Find Common Objects Across Few Image Collections

Shaban

Rahimi

Bansal

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Given a collection of bags where each bag is a set of images, our goal is to select one image from each bag such that the selected images are from the same object class. We model the selection as an energy minimization problem with unary and pairwise potential functions. Inspired by recent few-shot learning algorithms, we propose an approach to learn the potential functions directly from the data. Furthermore, we propose a fast greedy inference algorithm for energy minimization. We evaluate our approach on few-shot common object recognition as well as object co-localization tasks. Our experiments show that learning the pairwise and unary terms greatly improves the performance of the model over several well-known methods for these tasks. The proposed greedy optimization algorithm achieves performance comparable to state-of-the-art structured inference algorithms while being ∼10 times faster.

show abstract

“…[3,13] focus on separating distinguishable audio and video objects simultaneously. [6] learn to associate tracklets with words in documentary subtitles. Most of these multi-modal methods primarily focus on captioning or retrieval tasks, while our main focus is localization.…”

Section: Related Workmentioning

confidence: 99%

“…We believe an essential step to scale up to millions of object classes is to use abundant and labor-free web data. One pioneering work is from Chen et al [6] which learns to discover and localize new objects from documentary videos by associating subtitles to video tracklets. There is also work to associate phrases in the caption to its visually depicted objects in the image [33,20].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning to discover and localize visual objects with open vocabulary

Ye,

Zhang,

et al. 2018

Preprint

View full text Add to dashboard Cite

To alleviate the cost of obtaining accurate bounding boxes for training today's state-of-the-art object detection models, recent weakly supervised detection work has proposed techniques to learn from image-level labels. However, requiring discrete image-level labels is both restrictive and suboptimal. Real-world "supervision" usually consists of more unstructured text, such as captions. In this work we learn association maps between images and captions. We then use a novel objectness criterion to rank the resulting candidate boxes, such that high-ranking boxes have strong gradients along all edges. Thus, we can detect objects beyond a fixed object category vocabulary, if those objects are frequent and distinctive enough. We show that our objectness criterion improves the proposed bounding boxes in relation to prior weakly supervised detection methods. Further, we show encouraging results on object detection from image-level captions only.

show abstract

Discover and Learn New Objects from Documentaries

Cited by 23 publications

References 41 publications

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Learning to Find Common Objects Across Few Image Collections

Learning to discover and localize visual objects with open vocabulary

Contact Info

Product

Resources

About