Cross-Modal Progressive Comprehension for Referring Segmentation

Liu, Si; Hui, Tianrui; Huang, Shaofei; Wei, Yunchao; Li, Bo; Li, Guanbin

doi:10.1109/tpami.2021.3079993

Cited by 59 publications

(29 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Yu et al [34] proposed a modular network which decomposes the input natural language description into subject, location, and relationship attributes to improve the localization performance. Liu et al [35] adopted graph models with an attention mechanism to capture the relationship between the object regions in the given image. In association with visual affordance, Mi et al [36], [37] investigated the use of natural language to guide visual affordance detection.…”

Section: B Referring Expression Groundingmentioning

confidence: 99%

“…Hu et al [22] designed a bi-directional relationship inferring network to model the relationship between linguistic and visual features. Liu et al [35] proposed a model that first perceives all the entities in the image according to the entity and attribution words in the expression, then infers the location of the target object with the words that represent the relationship. Jing et al [23] first gets the position prior of the referring object based on the language and image, then generates segmentation mask based on the position prior.…”

Section: (Inherited Hypernym)mentioning

confidence: 99%

“…Secondly, our phrases only describe affordance without presenting entity words. Therefore, it is unable to utilize relationships between entities as in [23], [35], [40] to localize objects. We adopt short phrases to describe affordances rather than long sentences to meet practical scenarios.…”

Section: (Inherited Hypernym)mentioning

confidence: 99%

“…We adopt short phrases to describe affordances rather than long sentences to meet practical scenarios. This disables us to leverage the text context information to capture the relationship between linguistic and visual features like the operations in [20], [22], [35], [40]. The work mentioned above also utilized natural language to learn visual affordances [36], [37].…”

Section: (Inherited Hypernym)mentioning

confidence: 99%

“…where i is the elements of the ground-truth mask and H × W denotes the size of the ground-truth mask. [49] , CPD [50], OSAD [14], OAFFD [13], PSPNET [51], DEEPLABV3+ (DLABV3+) [52] CMSA [40], BRINET [22], CMPC [35]) IN TERMS OF FIVE METRICS (IOU (↑), F β (↑), E φ (↑), CC (↑), AND MAE (↓)). BOLD AND UNDERLINE INDICATE THE BEST AND THE SECOND-BEST SCORES, RESPECTIVELY.…”

Section: F Segmentation Modulementioning

confidence: 99%

See 4 more Smart Citations

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Lu¹,

Zhai²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

Lift dumbbell PA: lift, lift up, raise, grab, put down, pick up, take down, push, hold up, uplift, cause to raise, hold high, F: exercise, used for exercise of muscle-building E: indoor exercise Pick up chopsticks PA: take and lift upward, hold, grasp, move up and down, hold and lift F: pass food, kitchen utensil E: usually appears in kitchen or dining table AF: usually are made of wood Rolling baseball, croquet ball, golf ball, table tennis ball, tennis ball PA: rolling, move, can roll, move by rotating, roll over, rotate rapidly, turn round and round, rotate, move fast, spin, whirl, move around an axis or a center, cycle, revolve, change orientation or direction, twirl revolve AF: spherical Mix chopsticks, spoon, whisk

show abstract

Section: B Referring Expression Groundingmentioning

confidence: 99%

Section: (Inherited Hypernym)mentioning

confidence: 99%

Section: (Inherited Hypernym)mentioning

confidence: 99%

Section: (Inherited Hypernym)mentioning

confidence: 99%

Section: F Segmentation Modulementioning

confidence: 99%

See 3 more Smart Citations

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Lu¹,

Zhai²,

Luo³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

SeqTR: A Simple Yet Universal Network for Visual Grounding

Zhu

Zhou

Shen

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Cross‐modal fusion encoder via graph neural network for referring image segmentation

Zhang,

Piao

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

Referring image segmentation identifies the object masks from images with the guidance of input natural language expressions. Nowadays, many remarkable cross‐modal decoder are devoted to this task. But there are mainly two key challenges in these models. One is that these models usually lack to extract fine‐grained boundary information and gradient information of images. The other is that these models usually lack to explore language associations among image pixels. In this work, a Multi‐scale Gradient balanced Central Difference Convolution (MG‐CDC) and a Graph convolutional network‐based Language and Image Fusion (GLIF) for cross‐modal encoder, called Graph‐RefSeg, are designed. Specifically, in the shallow layer of the encoder, the MG‐CDC captures comprehensive fine‐grained image features. It could enhance the perception of target boundaries and provide effective guidance for deeper encoding layers. In each encoder layer, the GLIF is used for cross‐modal fusion. It could explore the correlation of every pixel and its corresponding language vectors by a graph neural network. Since the encoder achieves robust cross‐modal alignment and context mining, a light‐weight decoder could be used for segmentation prediction. Extensive experiments show that the proposed Graph‐RefSeg outperforms the state‐of‐the‐art methods on three public datasets. Code and models will be made publicly available at https://github.com/ZYQ111/Graph_refseg.

show abstract

Cross-Modal Progressive Comprehension for Referring Segmentation

Cited by 59 publications

References 58 publications

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

Phrase-Based Affordance Detection via Cyclic Bilateral Interaction

SeqTR: A Simple Yet Universal Network for Visual Grounding

Cross‐modal fusion encoder via graph neural network for referring image segmentation

Contact Info

Product

Resources

About