Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Luo, Gen; Zhou, Yiyi; Sun, Xiaoshuai; Cao, Liujuan; Wu, Chenglin; Deng, Cheng; Ji, Rongrong

doi:10.1109/cvpr42600.2020.01005

Cited by 202 publications

(137 citation statements)

References 31 publications

Supporting

Mentioning

132

Contrasting

Order By: Relevance

“…Finally, the cross-modal features are used to generate the final prediction masks. Unlike existing RES methods [13,14], which segment objects according to the query text, we input text in parallel with the input image to extract information. By combining crossmodal features from both image and text, we accurately segment fluorescein leakage.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Let’s Find Fluorescein: Cross-Modal Dual Attention Learning For Fluorescein Leakage Segmentation In Fundus Fluorescein Angiography

Yang¹,

Chen²,

Qiao

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Automatic segmentation of fluorescein leakage in fundus fluorescein angiography images is important in the clinical diagnosis of advanced diabetic retinopathy. Despite the recent success of deep-learning-based models in improving medical image segmentation, segmentation of fluorescein leakage has been ignored owing to (1) a lack of publicly available data with sufficient annotations for training a segmentation network and (2) incapability of supervised models to accurately localize fluorescein leakage at different imaging angles. To address these issues, we studied the automatic segmentation of fluorescein leakage in fundus fluorescein angiography images and devised a method involving (1) a cross-modal learning framework for fluorescein leakage segmentation using both image and text data, (2) a dual attention learning module for identifying important linguistic and visual features, and (3) fluorescein-related-keyword classification for identifying meaningful textual expressions pertaining to the location and type of fluorescein leakage. We demonstrate the effectiveness of the proposed method for an in-house fundus fluorescein angiography image data set.

show abstract

Section: Methodsmentioning

confidence: 99%

“…1c). The recent success of reference expression segmentation (RES), which involves the use of natural language expressions to locate objects [13,14], suggests the possibility of using cross-modal data to build a robust and effective framework for fluorescein leakage segmentation.…”

Section: Introductionmentioning

confidence: 99%

Let’s Find Fluorescein: Cross-Modal Dual Attention Learning For Fluorescein Leakage Segmentation In Fundus Fluorescein Angiography

Yang¹,

Chen²,

Qiao

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

show abstract

“…Recurrent network in [25,30] and pyramid feature map in [22] are utilized to excavate more semantic context for fusion. Luo et al [28] proposes to learn referring segmentation and comprehension in a unified manner for better aligned representation. Inspired by the prevalence of attention mechanism in computer vision field, researchers resort to attention mechanism for an effective fusion of multi-modal representations.…”

Section: Referring Image Segmentationmentioning

confidence: 99%

“…(3) Incomplete utilization of instancelevel features: visual embeddings are always treated equally in terms of every location without highlighting in instances. Most of the previous methods [2,27,28,37] in this area directly use the global image representations without considering the instance-level features. However, for referring segmentation, the instance-level features should be highlighted since the referent in expression is often prone to describe instances.…”

Section: Introductionmentioning

confidence: 99%

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

Li¹,

Wang²,

Mei³

et al. 2021

Preprint

View full text Add to dashboard Cite

Referring image segmentation is a typical multi-modal task, which aims at generating a binary mask for referent described in given language expressions. Prior arts adopt a bimodal solution, taking images and languages as two modalities within an encoder-fusion-decoder pipeline. However, this pipeline is sub-optimal for the target task for two reasons. First, they only fuse high-level features produced by uni-modal encoders separately, which hinders sufficient cross-modal learning. Second, the uni-modal encoders are pre-trained independently, which brings inconsistency between pre-trained uni-modal tasks and the target multi-modal task. Besides, this pipeline often ignores or makes little use of intuitively beneficial instance-level features. To relieve these problems, we propose MaIL, which is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder. Specifically, MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder, facilitating sufficient feature interaction across different modalities. Meanwhile, MaIL directly avoids the second limitation since no unimodal encoders are needed anymore. Moreover, for the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instancelevel features and promotes finer segmentation results. The proposed MaIL set a new state-of-the-art on all frequentlyused referring image segmentation datasets, including Ref-COCO, RefCOCO+, and G-Ref, with significant gains, 3%-10% against previous best methods. Code will be released soon.

show abstract

“…As for consensus constraints, multi-task learning [14,15,20] enhances the model's generalization and performance by adding multiple related tasks to the main task. However, it is not usable for a single task.…”

Section: Introductionmentioning

confidence: 99%

Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification

Wang

Fan

Guo

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

The consensus of multiple views on the same data will provide extra regularization, thereby improving accuracy. Based on this idea, we proposed a novel Knowledge-Supervised Learning (KSL) method for person re-identification (Re-ID), which can improve the performance without introducing extra inference cost. Firstly, we introduce isomorphic auxiliary training strategy to conduct basic multiple views that simultaneously train multiple classifier heads of the same network on the same training data. The consensus constraints aim to maximize the agreement among multiple views. To introduce this regular constraint, inspired by knowledge distillation that paired branches can be trained collaboratively through mutual imitation learning. Three novel constraints losses are proposed to distill the knowledge that needs to be transferred across different branches: similarity of predicted classification probability for cosine space constraints, distance of embedding features for euclidean space constraints, hard sample mutual mining for hard sample space constraints. From different perspectives, these losses complement each other. Experiments on four mainstream Re-ID datasets show that a standard model with KSL method trained from scratch outperforms its ImageNet pre-training results by a clear margin. With KSL method, a lightweight model without ImageNet pre-training outperforms most large models. We expect that these discoveries can attract some attention from the current de facto paradigm of "pre-training and fine-tuning" in Re-ID task to the knowledge discovery during model training. CCS CONCEPTS• Computing methodologies → Image representations; Object identification.

show abstract

Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Cited by 202 publications

References 31 publications

Let’s Find Fluorescein: Cross-Modal Dual Attention Learning For Fluorescein Leakage Segmentation In Fundus Fluorescein Angiography

Let’s Find Fluorescein: Cross-Modal Dual Attention Learning For Fluorescein Leakage Segmentation In Fundus Fluorescein Angiography

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

Knowledge-Supervised Learning: Knowledge Consensus Constraints for Person Re-Identification

Contact Info

Product

Resources

About