Unifying the Video and Question Attentions for Open-Ended Video Question Answering

Xue, Hongyang; Zhao, Zhou

doi:10.1109/tip.2017.2746267

Cited by 57 publications

(22 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lee et al [22] propose the Stacked Cross Attention Network (SCAN), which discovers the cross-modal alignments by a fine-grained attention scheme on regions in image and words in sentence. Beyond the fundamental imagetext matching, there are more emerging and attractive applications related to visual-semantic embedding, such as image captioning [42], [33], [27], [2] and visual question answering [3], [28], [41], [38], [44]. Anderson et al Unlike them, the fine-grained problem is the major difficulty in distinguishing different people in the description-based person Re-id, which needs to be carefully addressed.…”

Section: Related Work a Visual-semantic Embeddingmentioning

confidence: 99%

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Niu

Huang

Ouyang

et al. 2020

IEEE Trans. on Image Process.

124

View full text Add to dashboard Cite

Description-based person re-identification (Re-id) is an important task in video surveillance that requires discriminative cross-modal representations to distinguish different people. It is difficult to directly measure the similarity between images and descriptions due to the modality heterogeneity (the crossmodal problem). And all samples belonging to a single category (the fine-grained problem) makes this task even harder than the conventional image-description matching task. In this paper, we propose a Multi-granularity Image-text Alignments (MIA) model to alleviate the cross-modal fine-grained problem for better similarity evaluation in description-based person Re-id. Specifically, three different granularities, i.e., global-global, global-local and local-local alignments are carried out hierarchically. Firstly, the global-global alignment in the Global Contrast (GC) module is for matching the global contexts of images and descriptions. Secondly, the global-local alignment employs the potential relations between local components and global contexts to highlight the distinguishable components while eliminating the uninvolved ones adaptively in the Relation-guided Global-local Alignment (RGA) module. Thirdly, as for the local-local alignment, we match visual human parts with noun phrases in the Bi-directional Fine-grained Matching (BFM) module. The whole network combining multiple granularities can be end-to-end trained without complex preprocessing. To address the difficulties in training the combination of multiple granularities, an effective step training strategy is proposed to train these granularities step-by-step. Extensive experiments and analysis have shown that our method obtains the state-of-the-art performance on the CUHK-PEDES dataset and outperforms the previous methods by a significant margin.

show abstract

Section: Related Work a Visual-semantic Embeddingmentioning

confidence: 99%

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Niu

Huang

Ouyang

et al. 2020

IEEE Trans. on Image Process.

124

View full text Add to dashboard Cite

show abstract

“…Xue et al [83] also created a new dataset using the TGIF video captioning dataset. But their dataset is designed to capture the open-ended question answers.…”

Section: ) Encoder-decoder Based Methodsmentioning

confidence: 99%

“…MSVD-QA MSVD-QA dataset [83] is based on MSRVD dataset of video captioning and utilizes the video captions for automatically generating questions of the type "what, who, how, when and where".…”

Section: Datasetmentioning

confidence: 99%

See 1 more Smart Citation

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Khurana¹,

Deshpande

2021

IEEE Access

View full text Add to dashboard Cite

While describing visual data is a trivial task for humans, it is an intricate task for a computer. This is even more challenging if the visual data is a video. Comprehending a video and describing it is called Video Captioning. This involves understanding the semantics of a video and then generating humanlike descriptions of the video. It requires the collaboration of both research communities of computer vision and natural language processing. The captions generated by video captioning can be further utilized for video retrieval, summarization, question-answering, etc. Video Question-Answering (video-QA) involves querying the system to obtain an answer in response. This paper presents a brief survey of the video captioning techniques and a comprehensive review of existing techniques, datasets, and evaluation metrics for the task of video-QA. Video-QA techniques rely on the attention mechanism to generate relevant results. The presented survey shows that recent works on Memory Networks, Generative Adversarial Networks, and Reinforced Decoders, have the capability to handle the complexities and challenges of video-QA. Additionally, the graph-based methods, although less explored, give very promising results. In this article, we have discussed the emerging research directions and various application areas of video-QA.

show abstract

“…Generally successes and advancements in video/image captioning and attention mechanisms provide new research direction to solve the VideoQA task. An encoder-decoder based approach is proposed in [11], where unification of attentions is performed by considering both the quesiton sentence and the video. Frame-based visual attributes and question sentence based textual attributes are jointly learned in the approach proposed in [12].…”

Section: Related Workmentioning

confidence: 99%

Video Question Answering for Surveillance

Chowdhury¹,

Nguyen²,

Fookes³

et al. 2020

Preprint

View full text Add to dashboard Cite

There are many task in surveillance monitoring such as object detection, person identification, activity and action recognition etc. Integrating variety of surveillance task through a multimodal interactive system will benefit real-life deployment, and will also support human operators. We first introduce a dataset which is first of its kind and named as Surveillance Video Question Answering (SVideoQA) dataset. The multi-camera surveillance monitoring aspect is considered through the multimodal context of Video Question Answering (VideoQA) in the SVideoQA dataset. This paper proposes a deep learning model where VideoQA task on the SVideoQA dataset is attempted to solved in a manner where memory-driven relationship among appearance and motion aspect of the video features are captured. At each level of the relational reasoning respective attentive parts of the context of the motion and appearance features are identified forwarded through frame level and clip level relational reasoning module. Also, respective memories are updated which are again forwarded to the memory-relation module to finally predict the answer word. The proposed memory-driven multilevel relational reasoning is made compatible with the surveillance monitoring task through the incorporation of multi-camera relation module, which is able to capture and reason over the relationships among the video feeds across multiple cameras. Experimental outcome exhibits that the proposed memory-driven multilevel relational reasoning perform significantly better on the open-ended VideoQA task compared to other state-of-the art systems. The proposed method achieves an accuracy of 57\% and 57.6\% respectively for the single-camera and multi-camera task of the SVideoQA dataset.

show abstract

Unifying the Video and Question Attentions for Open-Ended Video Question Answering

Cited by 57 publications

References 17 publications

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Video Question Answering for Surveillance

Contact Info

Product

Resources

About