Learning Spatiotemporal Features with 3D Convolutional Networks

Tran, Du; Bourdev, Lubomir; Fergus, Rob; Torresani, Lorenzo; Paluri, Manohar

doi:10.1109/iccv.2015.510

Cited by 7,556 publications

(5,749 citation statements)

References 38 publications

Supporting

Mentioning

5,381

Contrasting

Unclassified

Order By: Relevance

“…By jointly encoding spatio-temporal information in the learning process, 3D convolutional networks [50] have achieved good performance in semantic video short classification. Other works use 2D CNN structures to perform recognition and detection tasks in video by fine-tuning the networks using video frames [51], or by combining video frames and optical flow maps as the input layer [42] [52].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

Zhang

et al. 2018

IEEE Trans. Multimedia

View full text Add to dashboard Cite

Abstract-Personal videos often contain visual distractors, which are objects that are accidentally captured that can distract viewers from focusing on the main subjects. We propose a method to automatically detect and localize these distractors through learning from a manually labeled dataset. To achieve spatially and temporally coherent detection, we propose extracting features at the Temporal-Superpixel (TSP) level using a traditional SVM-based learning framework. We also experiment with end-to-end learning using Convolutional Neural Networks (CNNs), which achieves slightly higher performance than other methods. The classification result is further refined in a post-processing step based on graph-cut optimization. Experimental results show that our method achieves an accuracy of 81% and a recall of 86%. We demonstrate several ways of removing the detected distractors to improve the video quality, including video hole filling; video frame replacement; and camera path re-planning. The user study results show that our method can significantly improve the aesthetic quality of videos.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Tran et al [50] proposed a 3D convolutional network for video classification. However, this structure does not produce the pixel-level labeling that is required in our task.…”

Section: Pixel-level Cnnmentioning

confidence: 99%

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

Zhang

et al. 2018

IEEE Trans. Multimedia

View full text Add to dashboard Cite

show abstract

“…Here, we adopt the recent approach of Xu et al [38], which encodes features learned by a conv-net model using VLAD. Here, we use the activations from the fc7 layer from a 3D conv-net [34] as our features. We first learn a codebook using k-means with k = 256.…”

Section: Daps For Action Detectionmentioning

confidence: 99%

“…Our network integrates the following modules: Visual encoder: It encodes a small video volume into a meaningful low dimensional feature vector. In practice, we use activations from the top layer of a 3D convolutional network trained for action classification (C3D network [34]). Sequence encoder: It encodes the sequence of visual codes as a discriminative sequence of hidden states.…”

Section: Architecturementioning

confidence: 99%

“…More details about the optimization problem are provided in the supplementary material. Implementation Details: for our visual encoder, we use the publicly available pre-trained C3D model [34] which has a temporal resolution of 16 frames. To shorten the training time of our implementation, we reduce the dimensionality of the activations from the second fully-connected layer (fc7 ) of our visual encoder from 4096 to 500 dimensions using PCA.…”

Section: Inference and Learningmentioning

confidence: 99%

See 1 more Smart Citation

DAPs: Deep Action Proposals for Action Understanding

Escorcia

Heilbron

Niebles

et al. 2016

Lecture Notes in Computer Science

355

316

View full text Add to dashboard Cite

Abstract. Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.

show abstract

Hardware Architecture for Deep Neural Network Accelerator

Yuan

Huang

Zhang

2022

Wiley Encyclopedia of Electrical and Electronics Engineering

View full text Add to dashboard Cite

Deep neural networks (DNNs) have become the most important and popular machine learning technique in the emerging artificial intelligence era. Because of their inherent large‐scale sizes, DNN models are both computation intensive and storage intensive, thereby posing huge challenges for efficient deployment. To overcome this problem, a promising solution is to build customized hardware accelerators to improve the processing speed and energy efficiency when executing DNNs. However, the architecture design of specialized DNN accelerator is nontrivial, given the massive amount of data movements, the rapid development of the DNN algorithms and models, the high demand of reconfigurability and programmability, and the strict requirement of preserving accuracy. To date, many different types of design solutions, varying on device, circuit, architecture, and algorithm levels, have been proposed and implemented in recent years. This article focuses on the review of digital CMOS‐based DNN hardware architecture. By analyzing the design requirements and challenges of DNN accelerators within the classical von Neumann framework, we introduce the basic underlying hardware architecture and computation mapping strategy. Based on that, the advanced optimization techniques are also described. The open problems and challenges for the future DNN hardware architecture are also analyzed and elaborated.

show abstract

Learning Spatiotemporal Features with 3D Convolutional Networks

Cited by 7,556 publications

References 38 publications

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

Detecting and Removing Visual Distractors for Video Aesthetic Enhancement

DAPs: Deep Action Proposals for Action Understanding

Hardware Architecture for Deep Neural Network Accelerator

Contact Info

Product

Resources

About