Yutaka Satoh scite author profile

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) training resulted in significant overfitting for UCF-101, HMDB-51, and Ac-tivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available1.

show abstract

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Hara

2017

View full text Add to dashboard Cite

Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatiotemporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfitting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at

show abstract

Anticipating Traffic Accidents with Adaptive Loss and Large-Scale Incident DB

Suzuki

Kataoka

Aoki

et al. 2018

View full text Add to dashboard Cite

Pre-Training Without Natural Images

et al. 2022

View full text Add to dashboard Cite

Is it possible to use convolutional neural networks pre-trained without any natural images to assist natural image understanding? The paper proposes a novel concept, Formula-driven Supervised Learning (FDSL). We automatically generate image patterns and their category labels by assigning fractals, which are based on a natural law. Theoretically, the use of automatically generated images instead of natural images in the pre-training phase allows us to generate an infinitely large dataset of labeled images. The proposed framework is similar yet different from Self-Supervised Learning because the FDSL framework enables the creation of image patterns based on any mathematical formulas in addition to self-generated labels. Further, unlike pre-training with a synthetic image dataset, a dataset under the framework of FDSL is not required to define object categories, surface texture, lighting conditions, and camera viewpoint. In the experimental section, we find a better dataset configuration through an exploratory study, e.g., increase of #category/#instance, patch rendering, image coloring, and training epoch. Although models pre-trained with the proposed Fractal DataBase (FractalDB), a database without natural images, do not necessarily outperform models pre-trained with human annotated datasets in all settings, we are able to partially surpass the accuracy of ImageNet/Places pre-trained models. The FractalDB pre-trained CNN also outperforms other pre-trained models on auto-generated datasets based on FDSL such as Bezier curves and Perlin noise. This is reasonable since natural objects and scenes existing around us are constructed according to fractal geometry. Image representation with the proposed FractalDB captures a unique feature in the visualization of convolutional layers and attentions.

show abstract

Robust object detection using a Radial Reach Filter (RRF)

Satoh

Kaneko

Niwa

et al. 2004

Systems & Computers in Japan

View full text Add to dashboard Cite

In this paper the authors report on a new algorithm used to separate an object from its background using a background image. In the past, simple background subtraction has been used because of its low processing costs and ease of implementation. However, because this method depends solely on brightness patterns in the object and shadows, it has problems such as an inability to deal with poor lighting conditions and an inability to detect regions in which the brightness levels of the object and shadows are similar. In order to resolve these problems, in this paper the authors propose a new filter process called a Radial Reach Filter (RRF). The authors define a new statistic called a Radial Reach Correlation (RRC) used to determine on a pixel‐by‐pixel basis the similar and dissimilar areas between a background image and a current scene. They then evaluate the local texture at pixel‐level resolution while reducing the effects of variations in lighting. In addition, by introducing a mechanism to adjust the defined region adaptively based on local characteristics of the background image, the authors are able to work with various shadows and objects in the scene. The authors perform a theoretical evaluation and experiments using real images to demonstrate the validity of their proposed method. © 2004 Wiley Periodicals, Inc. Syst Comp Jpn, 35(10): 63–73, 2004; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/scj.10590

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yutaka Satoh

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition

Anticipating Traffic Accidents with Adaptive Loss and Large-Scale Incident DB

Pre-Training Without Natural Images

Robust object detection using a Radial Reach Filter (RRF)

Contact Info

Product

Resources

About