The automatic organization of video databases according to the semantic content of data is a key aspect for e cient indexing and fast retrieval of audiovisual material. In order to generate indices that can be used to access a video database, a description of each video s e quence i s n e c essary. The identi cation of objects present in a frame and the track of their motion and interaction in space and time, is attractive but not yet very robust. For this reason, since the early 90's, attempts have been applied in trying to segment a video in shots. For each shot a representative frame of the shot, called k-frame, is usually chosen and the video can be analyzed through its k-frames. Even if abrupt scene changes are relatively easy to detected, it is more di cult to identify special effects, such as dissolve, that were o p erated in the editing stage to merge two shots. Unfortunately, these special e ects are normally used to stress the importance of the scene change from a content point of view, so they are extremely relevant therefore they should not be missed. Beside, it is very important to determine precisely the beginning and the end of the transition in the case of dissolves and fades. In this work, two new parameters are proposed. These characterize the precision of boundaries of special e ects when the scene change involves more than two frames. They are combined with the common recall and precision parameters. Three types of algorithms for cut detection are c onsidered: histogram-based, motionbased and contour-based. These algorithms are tested and compared on several video s e quences. Results will show that the best performance is achieved by the global histogram-based method which uses color information.
This work deals with the representation of audiovisual information, to organize its content for future tasks such as retrieval and information browsing. Some indications are provided to demonstrate that a cross‐modal analysis of simple visual and audio information is sufficient to organize an audiovisual sequence into semantically meaningful segments. Each segment defines a scene which is coherent from some semantic point of view. Depending on the sophistication of the cross‐modal analysis, the scene may represent either a generic story unit or more complex situations such as dialogues or actions. The results shown in this work indicate that audio classification is key in establishing relationships among consecutive shots, allowing us to reach a scene‐level description. A higher abstraction level can be reached when a correlation exists among nonconsecutive shots, defining what is called “video idioms.” Accordingly, a generic audio model is proposed: a linear combination of four classes of audio signals. For semantic purposes, it is meaningful to select the classes so that they can serve any subsequent scene characterization. When several audio sources are combined simultaneously, it is assumed that only one is linked to the semantic of the scene, and that it corresponds to the dominant class of audio (in energy terms). The different classes that identify each type of audio are selected to facilitate any decision related to a semantic characterization of the audiovisual information. The problem therefore lies in a source separation task. The proposed scheme classifies the audio signal into the following four component types: speech, music, silence, and miscellaneous other sounds. Its performance are quite satisfactory (∼90%) and were tested extensively using various types of source material. Considering a generic audiovisual sequence, video shots are merged according to this audio classification. Depending on the type of source material (broadcast news, commercials, documentaries, and movies), different types of scenes can be identified, e.g., a single advertisement in the case of commercials; a dialogue situation in a movie. The article describes some experimental simulations in these different fields. © 1998 John Wiley & Sons, Inc. Int J Imaging Syst Technol, 9, 320–331, 1998
This paper presents the ToCAI (Table of
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.