In k-means clustering we are given a set of n data points in d-dimensional space d and an integer k, and the problem is to determine a set of k points in d , called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance. We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.
Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
In k-means clustering we are given a set of n data points in d-dimensional space d and an integer k, and the problem is to determine a set of k points in d , called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.We consider the question of whether there exists a simple and practical approximation algorithm for k-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a (9 + ε)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least (9 − ε) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd's algorithm, this heuristic performs quite well in practice.
Two sources of document degradation are modeled -a) perspective distortion that occurs while photocopying or scanning thick, bound documents, and ii) degradation due to perturbation i n the optical scanning and digitization process: speckle, blurr, jitter, threshold. Perspective distortion is modeled by studying the underlying perspective geometry of the optical system of photocopiers and scanners. An illumination model is described to account for the non-linear intensity change occuring across a page in a perspectivedistorted document. The optical distortion process is modeled morphlogically. First a distance transform on the foreground is performed and followed by a random inversion of binary pixels where the probability of flip is a function of the distance of the pixel to the boundary of the foreground. Correlating the flipped pixels is modeled by a morphological closing operation. I IntroductionThere are many reasons for modeling document degradation. First, in order to study the performance of any OCR algorithm, it is necessary to characterize the perturbation in the output performance as a function of the perturbation in the input. This is possible oiily if we have a perturbation/degradation model for the input document. Second, a degradation model permits the evaluation of an algorithm for a continuum of degradation levels, from low degradation levels to high degradation levels. This in turn allows us to locate the 'break-down' point or the 'knee' of the algorithm which is not available in the commonly used evaluation methods, e.g., confusion matrices. Third, a knowledge of the degradation model can enable us to design algorithms for restoring degraded documents. Furthermore, OCR algorithm designers can make use of these degradation models explicitly rather than implicitly as is usually done in current literature.In this paper we model two sources of document degradation -i) perspective distortion that occurs while photocopying or scanning thick, bound documents, and ii) degradation due to perturbation in the optical process: speckle, blur, jitter, threshold, etc.. Perspective distortion is modeled by studying the underlying perspective geometry of the optical system of photocopiers and scanners. An illumination model is proposed to account for the non-linear intensity change occurring across a page in a perspectivedistorted document. The local optical distortion process is modeled morphologically. First a distance transform on the foreground is performed and followed by a random inversion of binary pixels where the probability of flip is a function of the distance of the pixel to the boundary of the foreground. Correlating the flipped pixels is modeled by a morphological closing operation.Baird [BaiSO] discusses a model for character degradation. His-model does not account for the.nonlinear distortions produced from perspective distortions. Loce [LL90] models the perturbation introduced due to mechanical disturbances in high-end Xerox photocopiers. Our paper models the distortions in geometry and illumin...
Readability is a crucial presentation attribute that web summarization algorithms consider while generating a querybaised web summary. Readability quality also forms an important component in real-time monitoring of commercial search-engine results since readability of web summaries impacts clickthrough behavior, as shown in recent studies, and thus impacts user satisfaction and advertising revenue.The standard approach to computing the readability is to first collect a corpus of random queries and their corresponding search result summaries, and then each summary is then judged by a human for its readabilty quality. An average readability score is then reported. This process is time consuming and expensive. Besides, the manual evaluation process can not be used in the real-time summary generation process. In this paper we propose a machine learning approach to the problem. We use the corpus as described above and extract summary features that we think may characterize readability. We then estimate a model (gradient boosted decision tree) that predicts human judgments given the features. This model can then be used in real time to estimate the readability of new (unseen) web search summaries and also be used in the summary generation process.We present results on approximately 5000 editorial judgments collected over the course of a year and show examples where the model predicts the quality well and where it disagrees with human judgments. We compare the results of the model to previous models of readability, most notably Collins-Thompson-Callan, Fog and Flesch-Kincaid, and see that our model shows substantially better correlation with editorial judgments as measured by Pearson's correlation coefficient. The learning algorithm also provides us with the relative importance of the features used.
We present a methodology for the quantitative performance evaluation of detection algorithms in computer vision. A common method is to generate a variety of input images by varying the image parameters and evaluate the performance of the algorithm, as algorithm parameters vary. Operating curves that relate the probability of misdetection and false alarm are generated for each parameter setting. Such an analysis does not integrate the performance of the numerous operating curves. We outline a methodology for summarizing many operating curves into a few performance curves. This methodology is adapted from the human psychophysics literature and is general to any detection algorithm. The central concept is to measure the effect of variables in terms of the equivalent effect of a critical signal variable, which in turn facilitates the determination of the breakdown point of the algorithm. We demonstrate the methodology by comparing the performance of two-line detection algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.