This paper describes a communication theory approach to document image recognition, patterned after the use of hidden Markov models in speech recognition. In general, a document recognition problem is viewed as consisting of three elements-an image generator, a noisy channel and an image decoder. A document image generator is a Markov source (stochastic finite-state automaton) that combines a message source with an imager. The message source produces a string of symbols, or text, that contains the information to be transmitted. The imager is modeled as a finite-state transducer that converts the one-dimensional message string into an ideal two-dimensional bitmap. The channel transforms the ideal image into a noisy observed image. The decoder estimates the message, given the observed image, by finding the a posteriori most probable path through the combined source and channel models using a Viterbi-like dynamic programming algorithm. The proposed approach is illustrated on the problem of decoding scanned telephone yellow pages to extract names and numbers from the listings. A finite-state model for yellow page columns was constructed and used to decode a database of scanned column images containing about 1100 individual listings. Overall, 99.5% of the listings were correctly recognized, with character classification rates of 98% and 99.6%, respectively, for the names and numbers.
Several approaches have previously been taken for identifying document image skew. At issue are efficiency, accuracy, and robustness. We work directly with the image, maximizing a function of the number of ON pixels in a scanline. Image rotation is simulated by either vertical shear or accumulation of pixel counts along sloped lines. Pixel sum differences on adjacent scanlines reduce isotropic background noise from non-text regions. To find the skew angle, a succession of values of this function are found. Angles are chosen hierarchically, typically with both a coarse sweep and a fine angular bifurcation. To increase efficiency, measurements are made on subsampled images that have been pre-filtered to maximize sensitivity to image skew. Results are given for a large set of images, including multiple and unaligned text columns, graphics and large area halftones. The measured intrinsic angular error is inversely proportional to the number of sampling points on a scanline. This method does not indicate when text is upside-down, and it also requires sampling the function at 90 degrees of rotation to measure text skew in landscape mode. However, such text orientation can be determined (as one of four directions) by noting that roman characters in all languages have many more ascenders than descenders, and using morphological operations to identify such pixels. Only a small amount of text is required for accurate statistical determination of orientation, and images without text are identified as such.
This paper describes a communication theory approach to document image recognition, pattemed after the use of hidden Markov models in speech recognition. A document recognition problem is viewed as consisting of three elementsan image generator, a noisy channel and an image decoder. A document image generator is a Markov source which combines a message source with an imager. The message source produces a string of symbols which contains the information to be transmitmi. The imager is modeled as a finite-state transducer which converts the message into an ideal bitmap. The channel transforms the ideal image into a noisy observed image. The decoder estimates the message from the observedimage by finding the aposteriori mostprobablepath through the combined source and channel models using a Viterbi-like algorithm. Application of the proposed method to decoding telephone yellow pages is described.
An approach to supervised training of character templates from page images and unaligned transcriptions is proposed. The template training problem is formulated as one of constrained maximum likelihood parameter estimation within the document image decoding framework. This leads to a three-phase iterative training algorithm consisting of transcription alignment, aligned template estimation (ATE) and channel estimation steps. The maximum likelihood ATE problem is shown to be NP-complete and thus an approximate solution approach is developed. An evaluation of the training procedure in a document-specific decoding task using the Univ. of Washington UW-II database of scanned technical journal articles is described.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.