a b s t r a c tIn the area of Information Retrieval, the task of automatic text summarization usually assumes a static underlying collection of documents, disregarding the temporal dimension of each document. However, in real world settings, collections and individual documents rarely stay unchanged over time. The World Wide Web is a prime example of a collection where information changes both frequently and significantly over time, with documents being added, modified or just deleted at different times. In this context, previous work addressing the summarization of web documents has simply discarded the dynamic nature of the web, considering only the latest published version of each individual document. This paper proposes and addresses a new challenge -the automatic summarization of changes in dynamic text collections. In standard text summarization, retrieval techniques present a summary to the user by capturing the major points expressed in the most recent version of an entire document in a condensed form. In this new task, the goal is to obtain a summary that describes the most significant changes made to a document during a given period. In other words, the idea is to have a summary of the revisions made to a document over a specific period of time. This paper proposes different approaches to generate summaries using extractive summarization techniques. First, individual terms are scored and then this information is used to rank and select sentences to produce the final summary. A system based on Latent Dirichlet Allocation model (LDA) is used to find the hidden topic structures of changes. The purpose of using the LDA model is to identify separate topics where the changed terms from each topic are likely to carry at least one significant change. The different approaches are then compared with the previous work in this area. A collection of articles from Wikipedia, including their revision history, is used to evaluate the proposed system. For each article, a temporal interval and a reference summary from the article's content are selected manually. The articles and intervals in which a significant event occurred are carefully selected. The summaries produced by each of the approaches are evaluated comparatively to the manual summaries using ROUGE metrics. It is observed that the approach using the LDA model outperforms all the other approaches. Statistical tests reveal that the differences in ROUGE scores for the LDA-based approach is statistically significant at 99% over baseline.
Abstract. In this paper, we explore the utility of Local Binary Pattern (LBP) descriptors and variance measure towards the development of efficient techniques in order to segment a large collection of historical machine printed document pages. The result of segmentation will help us to organize the document pages in a structural format, which is useful in many applications like historical document access. In our experiments, three basic reference models namely background, text and image models are used to segment various non-text information together with the text. The method is tested on an archive of Portuguese historical documents and shows promising results.
This paper presents a recognition system for isolated handwritten Bangla words, with a fixed lexicon, using a Hidden Markov Model (HMM). A stochastic search method, namely, Genetic Algorithm (GA) is used to train the HMM. The HMM is a left-right HMM. For feature extraction, the image boundary is traced both in the anticlockwise and clockwise directions and the significant changes in direction along the boundary are noted. Certain features defined on the basis of these changes are used in the proposed model.
Information Retrieval is the Informatics field primarily focused on all problems and challenges related to information storage and access. The large majority of works in this area are based on static collections of documents. However, many of these collections are dynamic, and have evolved over time with documents being added, edited or simply removed at different times. Even in highly dynamic environments such as the World Wide Web, research tends to be centered on the most recent version of the documents and all the past information is normally discarded. Recognizing these changes over dynamic text collections and exploiting them for document retrieval and presentation purposes introduce new and relevant research challenges. This paper addresses the opportunity that gains relevance in this context-summarization of changes in dynamic text collections. We first define the problem in order to produce a summary that describes textual changes to an entire document or a set of related documents over an user defined time period. Then, from literature we present an extensive overview of the relevant approaches depicting similar problems and at last some discussions including future aspects.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.