For the first question, we propose two CNN architectures to model the local features of videos for action recognition. While one piece of work focuses on how to design a network to effectively encode and aggregate multiple kinds of local features of a video, the other work proposes a convolutional pooling strategy to explore the temporal information hidden within the frame-level representations.These two works raise flexible CNN architectures that are compatible with video format and lead to promising action recognition performance.Supervised encoding and Fisher vector encoding are two representative schemes to create image representations. Both of them can achieve state-of-the-art image classification performance but through different strategies: the former extracts discriminative patterns from local features at the encoding stage while the latter preserves rich information into high-dimensional signatures derived from a generative model of local features. For the second problem, we propose a hybrid Fisher vector encoding scheme for image classification which combines the strategies from both of the above two encoding methods. The key idea is to leverage supervised encoding to decompose local features into a discriminative part and a residual part and then build a generative model based on this decomposition.For the third problem, we study a challenging problem of identifying unusual instances of known objects in images within an "open-world " setting. That is, we aim to find objects that are members of a known class, but which are not typical of that class. We propose to identify unusual objects by inspecting the distributions of local visual patterns at multiple image regions. Considering the promising performance of Region CNN [37], we represent an image by a set of local CNN features and then map them into scalar detection scores to get rid of the distraction influence of irrelevant content. To model the region-level score distribution we propose to use Gaussian Process (GP) to iii construct two separate generative models, one for "regular object" and the other for "other objects".We design a new covariance function to simultaneously model the detection score at a single location and the score dependencies between multiple regions. This treatment allows our method to capture the spatial dependencies between local regions, which turns out to be crucial for identifying unusual objects.iv
Declaration by AuthorThis thesis is composed of my original work, and contains no material previously published or written by another person except where due reference has been made in the text. I have clearly stated the contribution by others to jointly-authored works that I have included in my thesis.