Dual self-attention with co-attention networks for visual question answering

Liu, Yun; Zhang, Xiaoming; Zhang, Qianyun; Li, Chaozhuo; Huang, Feiran; Tang, Xianghong; Li, Zhoujun

doi:10.1016/j.patcog.2021.107956

Cited by 50 publications

(12 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On the other hand, research has demonstrated that attentional mechanisms are not entirely reliable and can even have counterproductive effects [39]. When making inferences, neural networks tend to incorporate target‐related contextual information as an integral part of the target itself.…”

Section: Analysis and Discussionmentioning

confidence: 99%

MSFA: Multi‐stage feature aggregation network for multi‐label image recognition

Chen,

Xu,

Zeng

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

Multi‐label image recognition (MLR) is a significant branch of image classification that aims to assign multiple categorical labels to each input. Previous research has focused on enhancing the learning of category‐related regional features. However, the potential impact of multi‐scale distributions in intra‐ and inter‐category targets on MLR tends to be neglected. Besides, semantic consistency for categories is restricted to be considered on single‐scale features, resulting in suboptimal feature extraction. To address the limitations of above, a Multi‐stage Feature Aggregation (MSFA) network is proposed. In MSFA, a novel local feature extraction method is suggested to progressively extract category‐related high‐resolution local features in both spatial and channel dimensions. Subsequently, local and global features are fused without additional up‐ and down‐sampling to enrich the scale diversity of the features while incorporating refined class‐specific information. Furthermore, a hierarchical prediction scheme for MLR is proposed, which generates classification confidence corresponding to different scales under hierarchical loss supervision. Consequently, the final output of the network comes from the joint prediction by the classifiers on multi‐scale features, ensuring a stronger feature extraction capability. The extensive experiments have been carried on VOC and MS‐COCO datasets, and the superiority of MSFA over existing mainstream methods has been verified.

show abstract

Section: Analysis and Discussionmentioning

confidence: 99%

MSFA: Multi‐stage feature aggregation network for multi‐label image recognition

Chen,

Xu,

Zeng

et al. 2024

IET Image Processing

View full text Add to dashboard Cite

show abstract

“…[PA23] offer an extensive review of efficient vision transformers. Through the advancement of effective token mixing strategies and efficient MLP layers, vision transformers can be significantly accelerated [LWZ*22, GHW*22, YPL*22]. For example, both CMT [GHW*22] and WaveViT [YPL*22] outperform EfficientNet [TL19] while maintaining a lower computational complexity.…”

Section: Limitations and Future Workmentioning

confidence: 99%

Auxiliary Features‐Guided Super Resolution for Monte Carlo Rendering

Hou,

Liu

2023

Computer Graphics Forum

View full text Add to dashboard Cite

This paper investigates super‐resolution to reduce the number of pixels to render and thus speed up Monte Carlo rendering algorithms. While great progress has been made to super‐resolution technologies, it is essentially an ill‐posed problem and cannot recover high‐frequency details in renderings. To address this problem, we exploit high‐resolution auxiliary features to guide super‐resolution of low‐resolution renderings. These high‐resolution auxiliary features can be quickly rendered by a rendering engine and at the same time provide valuable high‐frequency details to assist super‐resolution. To this end, we develop a cross‐modality transformer network that consists of an auxiliary feature branch and a low‐resolution rendering branch. These two branches are designed to fuse high‐resolution auxiliary features with the corresponding low‐resolution rendering. Furthermore, we design Residual Densely Connected Swin Transformer groups to learn to extract representative features to enable high‐quality super‐resolution. Our experiments show that our auxiliary features‐guided super‐resolution method outperforms both super‐resolution methods and Monte Carlo denoising methods in producing high‐quality renderings.

show abstract

“…In recent years, convolutional neural networks (CNNs) have been used to extract image features. In order to obtain representative and targeted image features, attention mechanisms are used to highlight important image regions related to the corresponding issues [5,6]. To obtain more accurate image feature maps, a stacked attention network (SAN) is proposed in which the output of the first attention is used as the query for the second attention [7].…”

Section: Related Workmentioning

confidence: 99%

MAFA-Net: Multimodal Attribute Feature Attention Network for visual question answering

Tang,

Ran,

Zong

et al. 2023

Preprint

View full text Add to dashboard Cite

Visual Question Answering (VQA) is a hot topic task to answer natural language questions related to the content of visual images. In most VQA models, visual appearance and attribute features are ignored, resulting in complex questions without correct answers.To solve these problems, we propose a new end-to-end VQA model called Multi-modal Attribute Feature Attention Network (MAFA-Net).Firstly, the self-guided word attention modulus is designed to connect entity words with semantic words. Secondly, two problematic adaptive visual attention modules are presented not only to extract important regional features, but also to focus on key attribute features (e.g., color, spatial relationships, etc.). Additionally, a combining strategy is proposed to better explore spatial relationships between objects and their appearance properties. Finally, the experimental results show that MAFA-Net achieves performance competitive with state-of-the-art models on two large-scale VQA datasets.

show abstract

Dual self-attention with co-attention networks for visual question answering

Cited by 50 publications

References 8 publications

MSFA: Multi‐stage feature aggregation network for multi‐label image recognition

MSFA: Multi‐stage feature aggregation network for multi‐label image recognition

Auxiliary Features‐Guided Super Resolution for Monte Carlo Rendering

MAFA-Net: Multimodal Attribute Feature Attention Network for visual question answering

Contact Info

Product

Resources

About