Gi-Cheon Kang scite author profile

Visual dialog (VisDial) is a task which requires a dialog agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and utilizes visually-grounded information. Visual reference resolution is a problem that addresses these challenges, requiring the agent to resolve ambiguous references in a given question and to find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution in VisDial. DAN consists of two kinds of attention modules, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a multi-head attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.

show abstract

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Seo¹,

Kang²,

Park³

et al. 2021

View full text Add to dashboard Cite

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two crossmodal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ ahjeongseo/MASN-pytorch.

show abstract

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Kang

Lim

Zhang

2019

Preprint

View full text Add to dashboard Cite

Label Propagation Adaptive Resonance Theory for Semi-Supervised Continuous Learning

Kim

Hwang

Kang

et al. 2020

View full text Add to dashboard Cite

Semi-supervised learning and continuous learning are fundamental paradigms for human-level intelligence. To deal with real-world problems where labels are rarely given and the opportunity to access the same data is limited, it is necessary to apply these two paradigms in a joined fashion. In this paper, we propose Label Propagation Adaptive Resonance Theory (LPART) for semi-supervised continuous learning. LPART uses an online label propagation mechanism to perform classification and gradually improves its accuracy as the observed data accumulates. We evaluated the proposed model on visual (MNIST, SVHN, CIFAR-10) and audio (NSynth) datasets by adjusting the ratio of the labeled and unlabeled data. The accuracies are much higher when both labeled and unlabeled data are used, demonstrating the significant advantage of LPART in environments where the data labels are scarce.

show abstract

Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer

Kang¹,

Park²,

Lee³

et al. 2021

View full text Add to dashboard Cite

Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse Graph Learning (SGL) method to formulate visual dialog as a graph structure learning task. SGL infers inherently sparse dialog structures by incorporating binary and score edges and leveraging a new structural loss function. Next, we introduce a Knowledge Transfer (KT) method that extracts the answer predictions from the teacher model and uses them as pseudo labels. We propose KT to remedy the shortcomings of single ground-truth labels, which severely limit the ability of a model to obtain multiple reasonable answers. As a result, our proposed model significantly improves reasoning capability compared to baseline methods and outperforms the state-of-the-art approaches on the VisDial v1.0 dataset. The source code is available at https://github. com/gicheonkang/SGLKT-VisDial.

show abstract

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Seo

Kang

Park

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Improving Robustness to Texture Bias via Shape-focused Augmentation

Lee

Hwang

Kang

et al. 2022

View full text Add to dashboard Cite

Development of Indoor Guide Robot based on RF Sensor for the Visually Impaired

Kim¹,

Kang²,

Park³

et al. 2023

JKIIS

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Gi-Cheon Kang

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Label Propagation Adaptive Resonance Theory for Semi-Supervised Continuous Learning

Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer

Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

Improving Robustness to Texture Bias via Shape-focused Augmentation

Development of Indoor Guide Robot based on RF Sensor for the Visually Impaired

Contact Info

Product

Resources

About