Double-layer affective visual question answering network

Guo, Zihan; Han, Dong‐Guk; Massetto, Francisco Isidro; Li, Kuan‐Ching

doi:10.2298/csis200515038g

Cited by 4 publications

(2 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast to single-modal tasks, multimodal tasks require extracting and understanding information from a single modality, which combining information from two different modalities for reasoning. Although this is challenging, current researchers have achieved many multimodal tasks, for instance, image-text matching [1], [2], image captioning [3], [4], and VQA [5], [6], [26]. As a typical representative of multimodal tasks, VQA requires understanding visual information and image information; what's more, com-bining the two to reason about the answer.…”

Section: Introductionmentioning

confidence: 99%

A Visual Question Answering Network Merging High- and Low-Level Semantic Information

Han

Chen

et al. 2023

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High-and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.

show abstract

Section: Introductionmentioning

confidence: 99%

A Visual Question Answering Network Merging High- and Low-Level Semantic Information

Han

Chen

et al. 2023

IEICE Trans. Inf. & Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The result is usually represented by a grayscale image, and the grayscale value of each pixel in the image indicates the probability that the pixel belongs to a saliency object. Saliency object detection has become an important preprocessing step in many computer vision applications, including image and video compression [2], image relocation [3], video tracking [4] and robot navigation [5], etc.…”

Section: Introductionmentioning

confidence: 99%

A new frog leaping algorithm-oriented fully convolutional neural network for dance motion object saliency detection

Lyu

Zhang

2022

COMSIS J

View full text Add to dashboard Cite

Image saliency detection is an important research topic in the field of computer vision. With the traditional saliency detection models, the texture detail s are not obvious and the edge contour is not complete. The accuracy and recall rate of object detection are low, which are mostly based on the manual features and prior information. With the rise of deep convolutional neural networks, saliency detection has been rapidly developed. However, the existing saliency methods still have some common shortcomings, and it is difficult to uniformly highlight the clear boundary and internal region of the whole object in complex images, mainly because of the lack of sufficient and rich features. In this paper, a new frog leaping algorithm-oriented fully convolutional neural network is proposed for dance motion object saliency detection. The VGG (Visual Geometry Group) model is improved. The final full connection layer is removed, and the jump connection layer is used for the saliency prediction, which can effectively combine the multi-scale information from different convolution layers in the convolutional neural network. Meanwhile, an improved frog leaping algorithm is used to optimize the selection of initial weights during network initialization. In the process of network iteration, the forward propagation loss of convolutional neural network is calculated, and the anomaly weight is corrected by using the improved frog leaping algorithm. When the network satisfies the terminal conditions, the final weight is optimized by one frog leaping to make the network weight further optimization. In addition, the new network can combine high-level semantic information and low-level detail information in a data-driven framework. In order to preserve the unity of the object boundary and inner region effectively, the fully connected conditional random field (CRF) model is used to adjust the obtained saliency feature map. In this paper, the precision recall (PR) curve, F-measure, maximum F-measure, weighted F-measure and mean absolute error (MAE) are tested on six widely used public data sets. Com pared with other most advanced and representative methods, the results show that the proposed method achieves better performance and it is superior to most representative methods. The presented method reveals that it has strong robustness for image saliency detection with various scenes, and can make the boundary and inner region of the saliency object more uniform and the detection results more accurate.

show abstract

Cross-modality co-attention networks for visual question answering

Han

Zhou

et al. 2021

Soft Comput

View full text Add to dashboard Cite

Double-layer affective visual question answering network

Cited by 4 publications

References 18 publications

A Visual Question Answering Network Merging High- and Low-Level Semantic Information

A Visual Question Answering Network Merging High- and Low-Level Semantic Information

A new frog leaping algorithm-oriented fully convolutional neural network for dance motion object saliency detection

Cross-modality co-attention networks for visual question answering

Contact Info

Product

Resources

About