2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01525
|View full text |Cite
|
Sign up to set email alerts
|

Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
45
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 111 publications
(45 citation statements)
references
References 35 publications
0
45
0
Order By: Relevance
“…On the most challenging G-Ref dataset (which contains significantly longer expressions), LAVT surpasses the respective second-best method on the validation and test subsets from the UMD partition by absolute margins of 6.84% and 5.44%, respectively. Similarly on the validation set from the Google partition, LAVT outperforms the second-best method EFN [14] by an absolute margin of 8.57%. This performance is achieved without using Ref-COCO as additional training data in contrast to EFN.…”
Section: Comparison With Othersmentioning
confidence: 96%
See 4 more Smart Citations
“…On the most challenging G-Ref dataset (which contains significantly longer expressions), LAVT surpasses the respective second-best method on the validation and test subsets from the UMD partition by absolute margins of 6.84% and 5.44%, respectively. Similarly on the validation set from the Google partition, LAVT outperforms the second-best method EFN [14] by an absolute margin of 8.57%. This performance is achieved without using Ref-COCO as additional training data in contrast to EFN.…”
Section: Comparison With Othersmentioning
confidence: 96%
“…The methods most related to ours are VLT [12] and EFN [14], where the former designs a Transformer decoder for fusing linguistic and visual features, and the latter adopts a convolutional vision backbone network for encoding language information. Differently from [12], we propose an early fusion scheme which effectively exploits the Transformer encoder for modeling multi-modal context.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations