Quanfu Fan scite author profile

Abstract. A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined to produce a strong multi-scale object detector. The unified network is learned end-to-end, by optimizing a multi-task loss. Feature upsampling by deconvolution is also explored, as an alternative to input upsampling, to reduce the memory and computation costs. State-of-the-art object detection performance, at up to 15 fps, is reported on datasets, such as KITTI and Caltech, containing a substantial number of small objects.

show abstract

Moments in Time Dataset: One Million Videos for Event Understanding

Monfort

Vondrick

Oliva

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

384

346

View full text Add to dashboard Cite

We present the Moments in Time Dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds. Modeling the spatial-audio-temporal dynamics even for actions occurring in 3 second videos poses many challenges: meaningful events do not include only people, but also objects, animals, and natural phenomena; visual and auditory events can be symmetrical in time ("opening" is "closing" in reverse), and either transient or sustained. We describe the annotation process of our dataset (each video is tagged with one action or activity label among 339 different classes), analyze its scale and diversity in comparison to other large-scale video datasets for action recognition, and report results of several baseline models addressing separately, and jointly, three modalities: spatial, temporal and auditory. The Moments in Time dataset, designed to have a large coverage and diversity of events in both visual and auditory modalities, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.

show abstract

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

2021

View full text Add to dashboard Cite

Adversarial T-Shirt! Evading Person Detectors in a Physical World

et al. 2020

View full text Add to dashboard Cite

A closer look at Faster R-CNN for vehicle detection

2016

View full text Add to dashboard Cite

Curve Matching, Time Warping, and Light Fields: New Algorithms for Computing Similarity between Curves

2006

View full text Add to dashboard Cite

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Chen¹,

Fan²,

Panda³

2021

Preprint

View full text Add to dashboard Cite

The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that the proposed approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2% with a small to moderate increase in FLOPs and model parameters. Our source codes and models will be publicly available.

show abstract

Temporal Sequence Modeling for Video Event Detection

Cheng¹,

Fan²,

Pankanti³

et al. 2014

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Quanfu Fan

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

Moments in Time Dataset: One Million Videos for Event Understanding

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Adversarial T-Shirt! Evading Person Detectors in a Physical World

A closer look at Faster R-CNN for vehicle detection

Curve Matching, Time Warping, and Light Fields: New Algorithms for Computing Similarity between Curves

CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

Temporal Sequence Modeling for Video Event Detection

Contact Info

Product

Resources

About