Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

Hong, Sunghwan; Cho, Seokju; Nam, Jisu; Lin, Stephen; Kim, Seungryong

doi:10.1007/978-3-031-19818-2_7

Cited by 43 publications

(24 citation statements)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The cosine similarity between each point in query feature map and support features is defined as below:

\begin{equation}CS({x}_q,{x}_s) = \frac{{\sum_{i = 1}^{\left| {F_s^{b,l,o}} \right|} {{\mathop{\rm Re}\nolimits} LU(\frac{{x_q^T \cdot {x}_{s,i}}}{{\left\| {{x}_q} \right\|\left\| {{x}_{s,i}} \right\|}})} }}{{\left| {F_s^{b,l,o}} \right|}}\end{equation}

where

{x}_q

is a vector of a point in

F_q^{b,l}

{x}_{s,i}

is the i ‐th value in

{x}_s = F_s^{b,l,o}

, and

| {F_s^{b,l,o}} |

is the number of elements in

F_s^{b,l,o}

. As done in [30, 46], we use the ReLU() function to make the network focus only on how the query point is similar to the specific class object in the supporting image, rather than how they are different. The CS() describes the similarity between each point in query feature map and all feature points in the support set.…”

Section: Our Methods Overviewmentioning

confidence: 99%

See 1 more Smart Citation

Multi‐similarity based hyperrelation network for few‐shot segmentation

Shi

Cui

Zhang

et al. 2022

IET Image Processing

View full text Add to dashboard Cite

Few-shot semantic segmentation aims at recognizing the object regions of unseen categories with only a few annotated examples as supervision. The key to few-shot segmentation is to establish a robust semantic relationship between the support and query images and to prevent overfitting. In this paper, an effective multi-similarity hyperrelation network (MSHNet) is proposed to tackle the few-shot semantic segmentation problem. In MSHNet, a new generative prototype similarity (GPS) is proposed, which, together with cosine similarity, establishes a strong semantic relationship between supported images and query images. In addition, a symmetric merging block (SMB) in MSHNet is proposed to efficiently merge multi-layer, multi-shot, multi-similarity features to generate hyperrelation features for semantic segmentation. Experimenting on two benchmark semantic segmentation datasets (Pascal−5 i and COCO−20 i ) shows that this method achieves a mean intersection-over-union score of 72.3% and 56.0%, respectively, which outperforms the state-of-the-art methods by 1.9% and 6.5%.

show abstract

“…The cosine similarity between each point in query feature map and support features is defined as below:

\begin{equation}CS({x}_q,{x}_s) = \frac{{\sum_{i = 1}^{\left| {F_s^{b,l,o}} \right|} {{\mathop{\rm Re}\nolimits} LU(\frac{{x_q^T \cdot {x}_{s,i}}}{{\left\| {{x}_q} \right\|\left\| {{x}_{s,i}} \right\|}})} }}{{\left| {F_s^{b,l,o}} \right|}}\end{equation}

where

{x}_q

is a vector of a point in

F_q^{b,l}

{x}_{s,i}

is the i ‐th value in

{x}_s = F_s^{b,l,o}

, and

| {F_s^{b,l,o}} |

is the number of elements in

F_s^{b,l,o}

Section: Our Methods Overviewmentioning

confidence: 99%

“…Recently, refs. [30, 46] used 4D convolution to establish the hyperrelation between multi‐layer features, but 4D convolution has high spatial complexity and time complexity. In this paper, we use multi‐similarity to build a more robust semantic relationship between support and query images.…”

Section: Related Workmentioning

confidence: 99%

Multi‐similarity based hyperrelation network for few‐shot segmentation

Shi

Cui

Zhang

et al. 2022

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…Existing FSS approaches follow the metric learning framework, including parameter-based, prototype-based, and hybrid methods. The parameter-based methods [47]- [51] compare the pairwise distance between query and support by a parameter-model, such as linear classification [47], 4D-convolution [50], and gaussian processes [49]. CWT [47] designed a classifier weight transformer to tune the weights of the transformer online with a support-set trained linear classifier which simplifies the meta-learning task.…”

Section: Few-shot Semantic Segmentationmentioning

confidence: 99%

Holistic Prototype Attention Network for Few-Shot Video Object Segmentation

Tang

Chen

Jiang

et al. 2024

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images that contain pixel-level object annotations. Existing methods have demonstrated that the domain agent-based attention mechanism is effective in FSVOS by learning the correlation between support images and query frames. However, the agent frame contains redundant pixel information and background noise, resulting in inferior segmentation performance. Moreover, existing methods tend to ignore inter-frame correlations in query videos. To alleviate the above dilemma, we propose a holistic prototype attention network (HPAN) for advancing FSVOS. Specifically, HPAN introduces a prototype graph attention module (PGAM) and a bidirectional prototype attention module (BPAM), transferring informative knowledge from seen to unseen classes. PGAM generates local prototypes from all foreground features and then utilizes their internal correlations to enhance the representation of the holistic prototypes. BPAM exploits the holistic information from support images and video frames by fusing co-attention and self-attention to achieve support-query semantic consistency and inner-frame temporal consistency. Extensive experiments on YouTube-FSVOS have been provided to demonstrate the effectiveness and superiority of our proposed HPAN method. Our source code and models are available anonymously at https: //github.com/NUST-Machine-Intelligence-Laboratory/HPAN.

show abstract

“…Following the similar idea, HSNet [11] computes the pixel-wise correlation between support-query pairs and enhances the correlation matrix with a 4D convolutional operation. VAT [12] extends the correlation enhancement module from a 4D convolutional network to a 4D swin transformer [41]. Though the pixel-wise correlation could retain the most abundant category information, these approaches might result in unnecessary information loss as they ignore the support background.…”

Section: B Few-shot Segmentationmentioning

confidence: 99%

“…To address this issue, some recent works [11], [12] explore the pixel-wise correlations between the query images and the foreground support features and have shown some advantages against the prototype-based approaches. However, these approaches ignore the backgrounds of the support images.…”

Section: Introductionmentioning

confidence: 99%

Pyramid Co-Attention Compare Network for Few-Shot Segmentation

et al. 2021

View full text Add to dashboard Cite

Few-shot segmentation(FSS), which aims to extract never learned classes of objects from query images with a few annotated support samples, is a challenging problem especially in the cases that the appearance of objects in the support and the query images is significant different. Therefore, we propose a deep network called Pyramid Co-Attention Compare Network (PCCNet) to narrow the gap between them by introducing a Pyramid Co-attention Module (PCAM). PCAM acts as a task-specific transformer to transform the features of corresponding objects in query and support images into a space in which they are much closer by taking advantage of the underlying relation between query and support images. We also introduce a Prototypical Guide Module (PGM) which uses non-parametric metric learning to guide parametric metric learning so as to combine the advantages of them. In addition, a Superpixel Refine Module(SRM) is proposed to optimize the final output segmentation masks. Experiments conducted on Pascal-5 i shows that our PCCNet achieves a mean Intersection-over-Union(mIoU) score of 63.01% for 1shot segmentation and 64.57% for 5-shot segmentation, outperforming state-of-the-art methods by margin of 2.2% and 1.6%, respectively.

show abstract

Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation

Cited by 43 publications

References 59 publications

Multi‐similarity based hyperrelation network for few‐shot segmentation

Multi‐similarity based hyperrelation network for few‐shot segmentation

Holistic Prototype Attention Network for Few-Shot Video Object Segmentation

Pyramid Co-Attention Compare Network for Few-Shot Segmentation

Contact Info

Product

Resources

About