Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Feng, Guang; Hu, Zhiwei; Zhang, Lihe; Lu, Huchuan

doi:10.1109/cvpr46437.2021.01525

Cited by 111 publications

(45 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the most challenging G-Ref dataset (which contains significantly longer expressions), LAVT surpasses the respective second-best method on the validation and test subsets from the UMD partition by absolute margins of 6.84% and 5.44%, respectively. Similarly on the validation set from the Google partition, LAVT outperforms the second-best method EFN [14] by an absolute margin of 8.57%. This performance is achieved without using Ref-COCO as additional training data in contrast to EFN.…”

Section: Comparison With Othersmentioning

confidence: 96%

“…The methods most related to ours are VLT [12] and EFN [14], where the former designs a Transformer decoder for fusing linguistic and visual features, and the latter adopts a convolutional vision backbone network for encoding language information. Differently from [12], we propose an early fusion scheme which effectively exploits the Transformer encoder for modeling multi-modal context.…”

Section: Related Workmentioning

confidence: 99%

“…Differently from [12], we propose an early fusion scheme which effectively exploits the Transformer encoder for modeling multi-modal context. Compared to [14], we do not rely on a complicated cross-modal decoder, leading to a clearer and more effective framework. Under fair comparisons, our method outperforms these two previous counterparts by large margins.…”

Section: Related Workmentioning

confidence: 99%

“…Previous approaches have developed various mechanisms for addressing this challenge, including dynamic convolutions [40], concatenations [18,27,40], cross-modal attentions [14,20,37,47,58], graph neural networks [31], etc.…”

Section: Pixel-word Attention Modulementioning

confidence: 99%

“…Compared to most of the previous cross-modal attention mechanisms [14,20,37,47,58], our pixel-word attention module (PWAM) produces a much smaller memory footprint as we avoid computing attention weights between two image-sized spatial feature maps, and is also simpler due to fewer attention steps. Fig.…”

Section: Pixel-word Attention Modulementioning

confidence: 99%

See 4 more Smart Citations

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Zhao¹,

Wang²,

Tang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Referring image segmentation is a fundamental visionlanguage task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring expression for highlighting relevant positions in the image. A paradigm for tackling this problem is to leverage a powerful vision-language ("cross-modal") decoder to fuse features independently extracted from a vision encoder and a language encoder. Recent methods have made remarkable advancements in this paradigm by exploiting Transformers as cross-modal decoders, concurrent to the Transformer's overwhelming success in many other vision-language tasks. Adopting a different approach in this work, we show that significantly better cross-modal alignments can be achieved through the early fusion of linguistic and visual features in intermediate layers of a vision Transformer encoder network. By conducting cross-modal feature fusion in the visual feature encoding stage, we can leverage the wellproven correlation modeling power of a Transformer encoder for excavating helpful multi-modal context. This way, accurate segmentation results are readily harvested with a light-weight mask predictor. Without bells and whistles, our method surpasses the previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

show abstract

Section: Comparison With Othersmentioning

confidence: 96%