Unsupervised Object Discovery and Segmentation in Videos

Schulter, Samuel; Leistner, Christian; Roth, Peter M.; Bischof, Horst

doi:10.5244/c.27.53

Cited by 12 publications

(7 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…At one end of the spectrum, fully supervised methods require careful annotation of ob-125 ject locations in the form of bounding boxes [16,34,32], segmentations [37] or even object part locations [33,38], which is costly and can frequently introduce inconsistency and ambiguity. On the other hand, unsupervised learning methods that do not require any supervision aim at 130 finding similar objects in a set of unlabelled images [7,39] or videos [40]. They are, however, often limited to frequently occuring and visually consistent objects and are easily susceptible to background clutter.…”

Section: Related Workmentioning

confidence: 99%

“…On the other hand, labelled videos involving human activity, like pouring milk or eating cereal are abundantly 40 available. Such data, however, violates the principle assumption since the prevelant themes of the video are now human body parts and background clutter instead of objects of interest; thus resulting in the failure of contemporary methods as demonstrated in our experiments.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Weak supervision for detecting object classes from activities

Srikantha

Gall

2017

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Weak supervision for detecting object classes from activities

Srikantha

Gall

2017

Computer Vision and Image Understanding

View full text Add to dashboard Cite

“…Therefore, there has been a surge in exploring in unsupervised and weakly-supervised approaches for object detection. However, fully unsupervised approaches [30,17] without any annotations currently give considerably inferior performance on similar tasks, while conventional weaklysupervised methods [2,16,42] use static images to learn the detectors. These object detectors, however, fail to generalize to videos due to shift in domain.…”

Section: Introductionmentioning

confidence: 99%

Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection

Yuan

Liang

Wang

et al. 2017

2017 IEEE International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

In this paper, we investigate a weakly-supervised object detection framework. Most existing frameworks focus on using static images to learn object detectors. However, these detectors often fail to generalize to videos because of the existing domain shift. Therefore, we investigate learning these detectors directly from boring videos of daily activities. Instead of using bounding boxes, we explore the use of action descriptions as supervision since they are relatively easy to gather. A common issue, however, is that objects of interest that are not involved in human actions are often absent in global action descriptions known as "missing label". To tackle this problem, we propose a novel temporal dynamic graph Long Short-Term Memory network (TD-Graph LSTM). TD-Graph LSTM enables global temporal reasoning by constructing a dynamic graph that is based on temporal correlations of object proposals and spans the entire video. The missing label issue for each individual frame can thus be significantly alleviated by transferring knowledge across correlated objects proposals in the whole video. Extensive evaluations on a large-scale daily-life action dataset (i.e., Charades) demonstrates the superiority of our proposed method. We also release object bounding-box annotations for more than 5,000 frames in Charades. We believe this annotated data can also benefit other research on video-based object recognition in the future.

show abstract

“…Like Gall (2014, 2017), we codetect small and medium sized objects, but do so without a depth map or heavy dependence on human pose data. Like Schulter et al (2013), we codetect both moving and stationary objects, but do so with a larger set of object classes and a larger video corpus. Also, like Ramanathan et al (2014), we use sentences, but do so for a vocabulary that goes beyond pronouns, nominals, and names that are used to codetect only human face tracks.…”

Section: Related Workmentioning

confidence: 99%

“…Schulter et al (2013) construct a Conditional Random Field (CRF) in each input video frame with segmented superpixels as vertices. They use both motion and appearance information as unary potentials, and put binary edges between both spatially and temporally neighboring superpixels.…”

Section: Related Workmentioning

confidence: 99%

Sentence Directed Video Object Codiscovery

Yu¹,

Siskind

2017

Int J Comput Vis

View full text Add to dashboard Cite

Video object codiscovery can leverage the weak semantic constraint implied by sentences that describe the video content. Our codiscovery method, like other object codetection techniques, does not employ any pretrained object models or detectors. Unlike most prior work that focuses on codetecting large objects which are usually salient both in size and appearance, our method can discover small or medium sized objects as well as ones that may be occluded for part of the video. More importantly, our method can codiscover multiple object instances of different classes within a single video clip. Although the semantic information employed is usually simple and weak, it can greatly boost performance by constraining the hypothesized object locations. Experiments show promising results on three datasets: an average IoU score of 0.423 on a new dataset with 15 object

show abstract

Unsupervised Object Discovery and Segmentation in Videos

Cited by 12 publications

References 35 publications

Weak supervision for detecting object classes from activities

Weak supervision for detecting object classes from activities

Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection

Sentence Directed Video Object Codiscovery

Contact Info

Product

Resources

About