Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection

Ye, Keren; Zhang, Mingda; Kovashka, Adriana; Li, Wei; Qin, Danfeng; Berent, Jesse

doi:10.1109/iccv.2019.00978

Cited by 49 publications

(48 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that the motivation of our proposed pseudo-supervised learning is conceptually different from the weakly supervised learning methods [31]- [35]. Weakly supervised learning methods are mostly used in object detection where manually annotated labels are heavily required and costly.…”

Section: Learning Frameworkmentioning

confidence: 99%

Pseudo-Supervised Learning for Semantic Multi-Style Transfer

Kim

2021

IEEE Access

View full text Add to dashboard Cite

Numerous methods for style transfer have been developed using unsupervised learning and gained impressive results. However, optimal style transfer cannot be conducted from a global fashion in certain style domains, mainly when a single target-style domain contains semantic objects that have their own distinct and unique styles, e.g., those objects in the anime-style domain. Previous methods are incongruent because the unsupervised learning can not provide the semantic mappings between the multistyle objects according to their unique styles. Thus, in this paper, we propose a pseudo-supervised learning framework for the semantic multi-style transfer (SMST), which consists of (i) a pseudo ground truth (pGT) generation phase and (ii) a SMST learning phase. In the pGT generation phase, multiple semantic objects of the photo images are separately transferred to the target-domain object styles in an object-oriented fashion. Then the transferred objects are composed back to an image, which is the pGT. In the SMST learning phase, a SMST network (SMSTnet) is trained with the pairs of the photo images and its respective pGT in a supervised manner. From this, our framework can provide the semantic mappings of multi-style objects. Moreover, to embrace the multi-styles of various objects into a single generator, we design the SMSTnet with channel attentions in conjunction with a discriminator dedicated to our pseudo-supervised learning. Our method has been applied and intensively tested for anime-style transfer learning. The experimental results demonstrate the effectiveness of our method and show its superiority compared to the state-of-theart methods.

show abstract

Section: Learning Frameworkmentioning

confidence: 99%

Pseudo-Supervised Learning for Semantic Multi-Style Transfer

Kim

2021

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Recent studies have also explored the related task of weakly supervised object detection (WSOD) using only image captions as supervision [44], [45]. Similar in spirit to our proposed caption processing module, these studies have ex-plored methods for extracting useful visual information explicitly from captions as a structured set of labels.…”

Section: Weakly Supervised Object Detection Using Image Captionsmentioning

confidence: 99%

Extracting Structured Supervision From Captions for Weakly Supervised Semantic Segmentation

Vilar

Perez

2021

IEEE Access

View full text Add to dashboard Cite

Weakly supervised semantic segmentation (WSSS) methods have received significant attention in recent years, since they can dramatically reduce the annotation costs of fully supervised alternatives. While most previous studies focused on leveraging classification labels, we explore instead the use of image captions, which can be obtained easily from the web and contain richer visual information. Existing methods for this task assigned text snippets to relevant semantic labels by simply matching class names, and then employed a model trained to localize arbitrary text in images to generate pseudoground truth segmentation masks. Instead, we propose a dedicated caption processing module to extract structured supervision from captions, consisting of improved relevant object labels, their visual attributes, and additional background categories, all of which are useful for improving segmentation quality. This module uses syntactic structures learned from text data, and semantic relations retrieved from a knowledge database, without requiring additional annotations on the specific image domain, and consequently can be extended immediately to new object categories. We then present a novel localization network, which is trained to localize only these structured labels. This strategy simplifies model design, while focusing training signals on relevant visual information. Finally, we describe a method for leveraging all types of localization maps to obtain high-quality segmentation masks, which are used to train a supervised model. On the challenging MS-COCO dataset, our method moves the state-of-the-art forward significantly for WSSS with image-level supervision by a margin of 7.6% absolute (26.7% relative) mean Intersection-over-Union, achieving 54.5% precision and 50.9% recall.

show abstract

“…TAM-NET [40] utilizes text to generate text activation maps, which can be used for augmenting class activation map in segmentation task. Cap2Det [52] leverages the signal that captions provide for weakly supervised detection. However, caption-enhanced image segmentation models are still inadequately explored in the literature.…”

Section: Related Workmentioning

confidence: 99%

“…Let us first briefly summarize the manipulation of image captions in most relevant works Cap2Det [52] or TAM [40] for image segmentation or detection. The input to caption processor is obtained by encoding each word with a word2vec model and average pooling over words, often intertwined with fully-connected layers for fine-tuning.…”

Section: Visual Occurrence Estimation By Contextual Entailmentmentioning

confidence: 99%

See 1 more Smart Citation

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

Tian

Wang

Feng

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Zero-shot image segmentation refers to the task of segmenting pixels from specific unseen semantic class. Previous methods mainly rely on historic segmentation tasks, such as using semantic embedding or word embedding of class names to infer a new segmentation model. In this work we describe Cap2Seg, a novel solution of zero-shot image segmentation that harnesses accompanying image captions for intelligently inferring spatial and semantic context for the zero-shot image segmentation task. As our main insight, image captions often implicitly entail the occurrence of a new class in an image and its most-confident spatial distribution. We define a contextual entailment question (CEQ) that tailors BERT-like text models. In specific, the proposed networks for inferring unseen classes consists of three branches (global / local / semi-global), which infer labels of unseen class from image level, image-stripe level or pixel level respectively. Comprehensive experiments and ablation studies are conducted on two image benchmarks, COCO-stuff and Pascal VOC. All clearly demonstrate the effectiveness of the proposed Cap2Seg, including a set of hardest unseen classes (i.e., image captions do not literally contain the class names and direct matching for inference fails). CCS CONCEPTS • Computing methodologies → Image segmentation.

show abstract

Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection

Cited by 49 publications

References 37 publications

Pseudo-Supervised Learning for Semantic Multi-Style Transfer

Pseudo-Supervised Learning for Semantic Multi-Style Transfer

Extracting Structured Supervision From Captions for Weakly Supervised Semantic Segmentation

Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation

Contact Info

Product

Resources

About