Abstract. This paper proposes a novel approach and a new benchmark for video summarization. Thereby we focus on user videos, which are raw videos containing a set of interesting events. Our method starts by segmenting the video by using a novel "superframe" segmentation, tailored to raw videos. Then, we estimate visual interestingness per superframe using a set of low-, mid-and high-level features. Based on this scoring, we select an optimal subset of superframes to create an informative and interesting summary. The introduced benchmark comes with multiple human created summaries, which were acquired in a controlled psychological experiment. This data paves the way to evaluate summarization methods objectively and to get new insights in video summarization. When evaluating our method, we find that it generates high-quality results, comparable to manual, human-created summaries.
We investigate human interest in photos. Based on our own and others' psychological experiments, we identify various cues for "interestingness", namely aesthetics, unusualness and general preferences. For the ranking of retrieved images, interestingness is more appropriate than cues proposed earlier. Interestingness is, for example, correlated with what people believe they will remember. This is opposed to actual memorability, which is uncorrelated to both of them. We introduce a set of features computationally capturing the three main aspects of visual interestingness that we propose and build an interestingness predictor from them. Its performance is shown on three datasets with varying context, reflecting diverse levels of prior knowledge of the viewers.
High-quality urban reconstruction requires more than multi-view reconstruction and local optimization. The structure of facades depends on the general layout, which has to be optimized globally. Shape grammars are an established method to express hierarchical spatial relationships, and are therefore suited as representing constraints for semantic facade interpretation. Usually inference uses numerical approximations, or hard-coded grammar schemes. Existing methods inspired by classical grammar parsing are not applicable on real-world images due to their prohibitively high complexity. This work provides feasible generic facade reconstruction by combining low-level classifiers with mid-level object detectors to infer an irregular lattice. The irregular lattice preserves the logical structure of the facade while reducing the search space to a manageable size. We introduce a novel method for handling symmetry and repetition within the generic grammar. We show competitive results on two datasets, namely the Paris2010 and the Graz50. The former includes only Hausmannian, while the latter includes Classicism, Biedermeier, Historicism, Art Nouveau and post-modern architectural styles
We propose a new approach for semantic segmentation of 3D city models. Starting from an SfM reconstruction of a street-side scene, we perform classification and facade splitting purely in 3D, obviating the need for slow imagebased semantic segmentation methods. We show that a properly trained pure-3D approach produces high quality labelings, with significant speed benefits (20x faster) allowing us to analyze entire streets in a matter of minutes. Additionally, if speed is not of the essence, the 3D labeling can be combined with the results of a state-of-the-art 2D classifier, further boosting the performance. Further, we propose a novel facade separation based on semantic nuances between facades. Finally, inspired by the use of architectural principles for 2D facade labeling, we propose new 3D-specific principles and an efficient optimization scheme based on an integer quadratic programming formulation.
There is an increasing interest in semantically annotated 3D models, e.g. of cities. The typical approaches start with the semantic labelling of all the images used for the 3D model. Such labelling tends to be very time consuming though. The inherent redundancy among the overlapping images calls for more efficient solutions. This paper proposes an alternative approach that exploits the geometry of a 3D mesh model obtained from multi-view reconstruction. Instead of clustering similar views, we predict the best view before the actual labelling. For this we find the single image part that bests supports the correct semantic labelling of each face of the underlying 3D mesh. Moreover, our singleimage approach may surprise because it tends to increase the accuracy of the model labelling when compared to approaches that fuse the labels from multiple images. As a matter of fact, we even go a step further, and only explicitly label a subset of faces (e.g. 10%), to subsequently fill in the labels of the remaining faces. This leads to a further reduction of computation time, again combined with a gain in accuracy. Compared to a process that starts from the semantic labelling of the images, our method to semantically label 3D models yields accelerations of about 2 orders of magnitude. We tested our multi-view semantic labelling on a variety of street scenes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.