Indoor Scene Change Captioning Based on Multimodality Data

Qiu, Yue; Satoh, Yutaka; Suzuki, Ryoichi; Iwata, Kenji; Kataoka, Hirokatsu

doi:10.3390/s20174761

Cited by 18 publications

(5 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[19] tackle the change captioning problem in a 3D setting by assuming multi-view images are available both before and after the change. [20] further propose an end-to-end framework for describing scene changes from various input modalities, namely, RGB images, depth images, and point cloud data. Recently, [22] proposed a task to explicitly localise changes in the form of 3D bounding boxes from two point clouds and describe detailed scene changes for a fixed classes of objects.…”

Section: Dmentioning

confidence: 99%

The Change You Want to See

Sachdeva

Zisserman

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

Section: Dmentioning

confidence: 99%

The Change You Want to See

Sachdeva

Zisserman

2023

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

show abstract

“…Then the images before and after the change as well as information capturing their differences are input into the decoder. Qiu et al 34,35 described changes based on multiview image information. Hosseinzadeh and Wang 36 formulated a training scheme that uses an auxiliary task to improve the training of the change captioning network.…”

Section: Related Workmentioning

confidence: 99%

Bidirectional difference locating and semantic consistency reasoning for change captioning

Sun

Yao

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

Change captioning is an emerging task to describe the changes between a pair of images. The difficulty in this task is to discover the differences between the two images. Recently, some methods have been proposed to address this problem. However, they all employ unidirectional difference localization to identify the changes. This can lead to ambiguity about the nature of the changes. Instead, we propose a framework with bidirectional difference localization and semantic consistency reasoning to describe the image changes. First, we locate the changes in the two images by capturing bidirectional differences. Then we design a decoder with spatial‐channel attention to generate the change caption. Finally, we introduce semantic consistency reasoning to constrain our bidirectional difference localization module and spatial‐channel attention module. Extensive experiments on three public data sets show that the performance of our proposed model outperforms the state‐of‐the‐art change captioning models by a large margin.

show abstract

“…Ref. [3] presents an end-to-end scene change understanding framework that observes the variation between two time-points using different types of input images (i.e., depth, RGB, and point cloud images). Meanwhile, Ref.…”

Section: Introductionmentioning

confidence: 99%

Scene Changes Understanding Framework Based on Graph Convolutional Networks and Swin Transformer Blocks for Monitoring LCLU Using High-Resolution Remote Sensing Images

et al. 2022

View full text Add to dashboard Cite

High-resolution remote sensing images with rich land surface structure can provide data support for accurately understanding more detailed change information of land cover and land use (LCLU) at different times. In this study, we present a novel scene change understanding framework for remote sensing which includes scene classification and change detection. To enhance the feature representation of images in scene classification, a robust label semantic relation learning (LSRL) network based on EfficientNet is presented for scene classification. It consists of a semantic relation learning module based on graph convolutional networks and a joint expression learning framework based on similarity. Since the bi-temporal remote sensing image pairs include spectral information in both temporal and spatial dimensions, land cover and land use change monitoring can be improved by using the relationship between different spatial and temporal locations. Therefore, a change detection method based on swin transformer blocks (STB-CD) is presented to obtain contextual relationships between targets. The experimental results on the LEVIR-CD, NWPU-RESISC45, and AID datasets demonstrate the superiority of LSRL and STB-CD over other state-of-the-art methods.

show abstract

Indoor Scene Change Captioning Based on Multimodality Data

Cited by 18 publications

References 52 publications

The Change You Want to See

The Change You Want to See

Bidirectional difference locating and semantic consistency reasoning for change captioning

Scene Changes Understanding Framework Based on Graph Convolutional Networks and Swin Transformer Blocks for Monitoring LCLU Using High-Resolution Remote Sensing Images

Contact Info

Product

Resources

About