In this paper, we present a robust method for scene recognition, which leverages Convolutional Neural Networks (CNNs) features and Sparse Coding setting by creating a new representation of indoor scenes. Although CNNs highly benefited the fields of computer vision and pattern recognition, convolutional layers adjust weights on a globalapproach, which might lead to losing important local details such as objects and small structures. Our proposed scene representation relies on both: global features that mostly refers to environment's structure, and local features that are sparsely combined to capture characteristics of common objects of a given scene. This new representation is based on fragments of the scene and leverages features extracted by CNNs. The experimental evaluation shows that the resulting representation outperforms previous scene recognition methods on Scene15 and MIT67 datasets, and performs competitively on SUN397, while being highly robust to perturbations in the input image such as noise and occlusion.
Automatic scene recognition is still regarded as an open challenge, even though there are reports of outperforming human accuracy. This is specially true for indoor scenes, since they can be well represented by their composing objects, which is highly variable information. Objects vary in angle, size, texture, besides being often partially occluded on crowded scenes. Even though Convolutional Neural Networks showed remarkable performance for most image-related problems, for indoor scenes the top performances were attributed to approaches that added object-level information to the methodology, modeling their intricate relationship. Knowing that Recurrent Neural Networks were designed to model structure from a given sequence of elements, only recently researchers started exploiting its advantages applied to the problem of scene recognition. Even though such works are usually below the state of the art performance, there is still plenty of room to unravel the full potential of recurrent methodologies. Thus, this work proposes representing an image as a sequence of object-level information, extracting highly semantic features from models pre-trained on an object-centric dataset, in order to feed a bidirectional Long Short-Term Memory network trained for scene classification. We perform a Many-to-Many training approach, such that each input outputs a corresponding scene prediction, allowing us to use each individual prediction to boost recognition with a weighted voting approach. To the best of our knowledge, our sequence representation, as well as our late fusion of predictions was little pursued by methods from the literature based on recurrent approaches for scene recognition. We evaluated our proposal on three widely known datasets for scene recognition: Scene15, MIT67 and SUN397, outperforming recurrent-based methods on MIT67, a dataset entirely dedicated to the problem of indoor scenes, while the others, which mix indoor and outdoor environments presented as a greater challenge for our approach. However, we were able to improve performance on all datasets over the most successful methods on the literature by pairing our work to a few of them in an ensemble of classifiers. Meaning a joint strategy with our method was beneficial for the task of scene classification.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.