Optical Music Recognition (OMR) is the challenge of understanding the content of musical scores. Accurate detection of individual music objects is a critical step in processing musical documents, because a failure at this stage corrupts any further processing. So far, all proposed methods were either limited to typeset music scores or were built to detect only a subset of the available classes of music symbols. In this work, we propose an end-to-end trainable object detector for music symbols that is capable of detecting almost the full vocabulary of modern music notation in handwritten music scores. By training deep convolutional neural networks on the recently released MUSCIMA++ dataset which has symbol-level annotations, we show that a machine learning approach can be used to accurately detect music objects with a mean average precision of up to 80%.
The study presented in this paper analyses the visual MPEG-7 descriptors from a statistical point of view. A statistical analysis is able to reveal the properties and qualities of the used descriptors: redundancies, sensitivity to media content, etc. These aspects were not considered in the MPEG-7 design process where the major goal was optimising the retrieval rate. For the statistical analysis eight basic visual descriptors were applied to three media collections: the Brodatz dataset, a selection of the Corel photo dataset and a set of coatsof-arms images. The resulting feature vectors were analysed with four statistical methods: mean and variance of description elements, distribution of elements, cluster analysis (hierarchical and topological) and factor analysis. The analysis revealed that, for example, most MPEG-7 descriptions are highly redundant and sensitive to the presence of colour shades.
Visual information retrieval (VIR) is a research area with more than 300 scientific publications every year. Technological progress lets surveys become out of date within a short duration. This paper intends to shortly describe selected important advances in VIR in recent years and point out promising directions for future research. A software architecture for visual media handling is proposed that allows handling image and video content equally. This allows to integrate both types of media in a singe system. The major advances in feature design are sketched and new methods for semantic enrichment are proposed. Guidelines are formulated for further development of feature extraction methods. The most relevant retrieval processes are described and an interactive method for visual mining is suggested that really puts "the human in the loop". For evaluation, the classic recall-and precision-based approach is discussed as well as a new procedure based on MPEG-7 and statistical data analysis. Finally, an "ideal" architecture for VIR systems is outlined. The selection of VIR topics is subjective and represents the author's point of view. The intention is to provide a short but substantial introduction to the field of VIR.
The study presented in this paper analyses descriptions extracted with MPEG-7-descriptors from visual content from the statistical point of view. Good descriptors should generate descriptions with high variance, a well-balanced cluster structure and high discriminance to be able to distinguish different media content. Statistical analysis reveals the quality of the used description extraction algorithms. This was not considered in the MPEG-7-design process where optimising the recall was the major goal. For the analysis eight basic visual descriptors were applied on three media collections: the Brodatz dataset (monochrome textures), a selection of the Corel dataset (colour photos) and a set of coats-of-arms images (artificial colour images with few colour gradations). The results were analysed with four statistical methods: mean and variance of descriptor elements, distribution of elements, cluster analysis (hierarchical and topological) and factor analysis. The main results are: The best descriptors for combination are Color Layout, Dominant Color, Edge Histogram and Texture Browsing. The other are highly dependent on these. The colour histograms (Color Structure and Scalable Color) perform badly on monochrome input. Generally, all descriptors are highly redundant and the application of complexity reduction transformations could save up to 80% of storage and transmission capacity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.