Kodai Nakashima scite author profile

Kodai Nakashima

5Publications

22Citation Statements Received

115Citation Statements Given

How they've been cited

How they cite others

133

115

Affiliations

National Institute of Advanced Industrial Science and Technology

Publications

Order By: Most citations

Describing and Localizing Multiple Changes with Transformers

Qiu

Yamamoto

Nakashima

et al. 2021

View full text Add to dashboard Cite

Change captioning tasks aim to detect changes in image pairs observed before and after a scene change and generate a natural language description of the changes. Existing change captioning studies have mainly focused on scenes with a single change. However, detecting and describing multiple changed parts in image pairs is essential for enhancing adaptability to complex scenarios. We solve the above issues from three aspects: (i) We propose a CG-based multi-change captioning dataset; (ii) We benchmark existing state-of-the-art methods of single change captioning on multi-change captioning; (iii) We further propose Multi-Change Captioning transformers (MCCFormers) that identify change regions by densely correlating different regions in image pairs and dynamically determines the related change regions with words in sentences. The proposed method obtained the highest scores on four conventional change captioning evaluation metrics for multichange captioning. In addition, existing methods generate a single attention map for multiple changes and lack the ability to distinguish change regions. In contrast, our proposed method can separate attention maps for each change and performs well with respect to change localization. Moreover, the proposed framework outperformed the previous state-of-the-art methods on an existing change captioning benchmark, CLEVR-Change, by a large margin (+6.1 on BLEU-4 and +9.7 on CIDEr scores), indicating its general ability in change captioning tasks. Our code and dataset will be publicly available through the project page 1 . Before Change captions After Caption 1: The large gray rubber sphere has disappeared. (delete) Caption 2: There is no longer a large cyan metal cube. (delete) Caption 3: The large brown metal sphere was moved from its original location. (move) Caption 4: The small yellow rubber cylinder was replaced by a small red rubber sphere. (replace)

show abstract

Can Vision Transformers Learn without Natural Images?

Nakashima¹,

Kataoka²,

Matsumoto³

et al. 2021

Preprint

View full text Add to dashboard Cite

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a largescale dataset and human-annotated labels, recent largescale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and laborintensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pretrained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0. The codes, datasets, and pre-trained models will be publicly available 1 * indicates equal contribution 1 https://hirokatsukataoka16.github.io/ Vision-Transformers-without-Natural-Images/.

show abstract

Replacing Labeled Real-image Datasets with Auto-generated Contours

Kataoka

Hayamizu

Yamada

et al. 2022

View full text Add to dashboard Cite

Can Vision Transformers Learn without Natural Images?

Nakashima

Kataoka

Matsumoto

et al. 2022

AAAI

View full text Add to dashboard Cite

Is it possible to complete Vision Transformer (ViT) pre-training without natural images and human-annotated labels? This question has become increasingly relevant in recent months because while current ViT pre-training tends to rely heavily on a large number of natural images and human-annotated labels, the recent use of natural images has resulted in problems related to privacy violation, inadequate fairness protection, and the need for labor-intensive annotations. In this paper, we experimentally verify that the results of formula-driven supervised learning (FDSL) framework are comparable with, and can even partially outperform, sophisticated self-supervised learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. We also consider ways to reorganize FractalDB generation based on our tentative conclusion that there is room for configuration improvements in the iterated function system (IFS) parameter settings of such databases. Moreover, we show that while ViTs pre-trained without natural images produce visualizations that are somewhat different from ImageNet pre-trained ViTs, they can still interpret natural image datasets to a large extent. Finally, in experiments using the CIFAR-10 dataset, we show that our model achieved a performance rate of 97.8, which is comparable to the rate of 97.4 achieved with SimCLRv2 and 98.0 achieved with ImageNet.

show abstract

Joint Pedestrian Detection and Risk-level Prediction with Motion-Representation-by-Detection

Kataoka

Suzuki

Nakashima

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Kodai Nakashima

Describing and Localizing Multiple Changes with Transformers

Can Vision Transformers Learn without Natural Images?

Replacing Labeled Real-image Datasets with Auto-generated Contours

Can Vision Transformers Learn without Natural Images?

Joint Pedestrian Detection and Risk-level Prediction with Motion-Representation-by-Detection

Contact Info

Product

Resources

About