Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

Zha, Shengxin; Luisier, Florian; Andrews, W.D.; Srivastava, Nitish; Salakhutdinov, Ruslan

doi:10.48550/arxiv.1503.04144

Cited by 29 publications

(29 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the two-stream model also has the advantage of being trained specifically on a video dataset, we observe that the learned representations do not transfer favorably to the MED11 dataset in contrast to fc7 and fc6 features trained on ImageNet. A similar observation was made in [38,41], where simple CNN features trained from ImageNet consistently provided the best results.…”

Section: Event Retrievalsupporting

confidence: 73%

“…Deep network features learned from spatial data [8,12,30] and temporal flow [30] have also shown comparable results. However, recent works in complex event recognition [38,41] have shown that spatial Convolutional Neural Network (CNN) features learned from ImageNet [2] without fine-tuning on video, accompanied by suitable pooling and encoding strategies achieves state-of-the-art performance. In contrast to these methods which either propose handcrafted features or learn feature representations with a fully supervised objective from images or videos, we try to learn an embedding in an unsupervised fashion.…”

Section: Related Workmentioning

confidence: 99%

“…In practice, the embedding function can be a CNN built from the frame pixels, or any underlying image or video representation. However, following the recent success of ImageNet trained CNN features for complex event videos [38,41], we choose to learn an embedding on top of the fully connected fc6 layer feature representation obtained by passing the frame through a standard CNN [16] architecture. We use a simple model with a fully connected layer followed by a rectified linear unit (ReLU) and local response normalization (LRN) layer, with dropout regularization.…”

Section: Embedding Functionmentioning

confidence: 99%

See 2 more Smart Citations

Learning Temporal Embeddings for Complex Video Analysis

Ramanathan

Tang

Mori

et al. 2015

2015 IEEE International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are sequences of temporally and semantically coherent images. We leverage this information to learn temporal embeddings for video frames by associating frames with the temporal context that they appear in. To do this, we propose a scheme for incorporating temporal context based on past and future frames in videos, and compare this to other contextual representations. In addition, we show how data augmentation using multi-resolution samples and hard negatives helps to significantly improve the quality of the learned embeddings. We evaluate various design decisions for learning temporal embeddings, and show that our embeddings can improve performance for multiple video tasks such as retrieval, classification, and temporal order recovery in unconstrained Internet video.

show abstract

Section: Event Retrievalsupporting

confidence: 73%

Section: Related Workmentioning

confidence: 99%

Section: Embedding Functionmentioning

confidence: 99%

See 1 more Smart Citation

Learning Temporal Embeddings for Complex Video Analysis

Ramanathan

Tang

Mori

et al. 2015

2015 IEEE International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

show abstract

“…In conjunction with other types of lay- These recent advancements in machine learning have led to "Deep learning", an extension dealing with deeper neural networks (DNNs), particularly Deep CNNs. For these models, image classification remains one of the most popular and robust tasks [35,15], tasking the DNN with recognizing patterns in computer vision and speech. Classification of images stands frequently as a benchmark for newly developed architectures and data augmentation methods [32,26,12,7].…”

Section: Introductionmentioning

confidence: 99%

Architectural Resilience to Foreground-and-Background Adversarial Noise

Cheng,

2020

Preprint

View full text Add to dashboard Cite

Adversarial attacks in the form of imperceptible perturbations of normal images have been extensively studied, and for every new defense methodology created, multiple adversarial attacks are found to counteract it. In particular, a popular style of attack, exemplified in recent years by DeepFool and Carlini-Wagner, relies solely on white-box scenarios in which full access to the predictive model and its weights are required. In this work, we instead propose distinct model-agnostic benchmark perturbations of images in order to investigate the resilience and robustness of different network architectures. Results empirically determine that increasing depth within most types of Convolutional Neural Networks typically improves model resilience towards general attacks, with improvement steadily decreasing as the model becomes deeper. Additionally, we find that a notable difference in adversarial robustness exists between residual architectures with skip connections and non-residual architectures of similar complexity. Our findings provide direction for future understanding of residual connections and depth on network robustness.

show abstract

“…PN uses an element-wise power operation to discount large values and increase small values of video representations. As one of the most significant improvements in the past few years, this simple algorithm essentially makes Fisher Vectors and VLADs useful in practice, and has been widely adopted by the research community to both handcrafted [17,27] and deeply-learned features [28,29]. However, PN can only alleviate the sparse and bursty distribution problems.…”

Section: Introductionmentioning

confidence: 99%

Improving Human Activity Recognition Through Ranking and Re-ranking

Lan¹,

Yu²,

Hauptmann³

2015

Preprint

View full text Add to dashboard Cite

We propose two well-motivated ranking-based methods to enhance the performance of current state-of-the-art human activity recognition systems. First, as an improvement over the classic power normalization method, we propose a parameter-free ranking technique called rank normalization (RaN). RaN normalizes each dimension of the video features to address the sparse and bursty distribution problems of Fisher Vectors and VLAD. Second, inspired by curriculum learning, we introduce a training-free re-ranking technique called multi-class iterative re-ranking (MIR). MIR captures relationships among action classes by separating easy and typical videos from difficult ones and re-ranking the prediction scores of classifiers accordingly. We demonstrate that our methods significantly improve the performance of state-of-the-art motion features on six realworld datasets.

show abstract

Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

Cited by 29 publications

References 26 publications

Learning Temporal Embeddings for Complex Video Analysis

Learning Temporal Embeddings for Complex Video Analysis

Architectural Resilience to Foreground-and-Background Adversarial Noise

Improving Human Activity Recognition Through Ranking and Re-ranking

Contact Info

Product

Resources

About