This paper proposes a document segmentation, classification and recognition system for automatically reading daily-received ofice documents that have complex layout structures, such as multiple columns and mixed-mode contents of texts, graphics and half-tone pictures. First, the block segmentation employs a twostep run-length smoothing algorithm for decomposing any document into single-mode blocks. Next, based on cluslering rules the block classification classifies each block into one of text, horizonld or vertical lines, graphics, and pictures. The text block is separated into isolated characters using projection profiles, and which are translated into ASCII codes through a font-and sizeindependent character recognition subsystem. Logo pictures discriminated from half-tone pictures are identified and converted into symbolic words. The experimental results show that the proposed system is capable of correctly reading different styles of mixed-mode printed documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.