Finding the correct category (class) a new unclassified document belongs to is an interesting and difficult problem, with a wide range of applications. Our methodology for narrative text classification is based on two techniques: we calculate the distance (similarity) between the new unclassified document and all the pre-classified documents of each class and also calculate the similarity of the new document to the ‘average class document’ of each class. In both cases we use key phrases (text phrases or key terms) as the distinctive features of our text classification methodology and eventually the proposed text classification method is based on the automatic extraction of an authority list of key phrases that is appropriate for discriminating between different classes. In this paper, we apply this methodology in handling Greek text and we present the key concepts, the algorithms, and some critical decisions. A number of parameters of the mining algorithm are also fine tuned. The actual text classification system, the adopted (embedded) ideas and the alternative values of parameters are evaluated using two training sets (test collections).
In this paper we focus on the design and implementation of low cost, cross language and cross platform information retrieval and documentation tools suitable for the collection, organization and administration of unstructured and semi-structured information imported from various sources. A modular Computer-Assisted Information Resources Navigation (CAIRN) software architecture is proposed and the requirements of each module are presented. A discussion about the implementation is based on the experiments made with a prototype of such a software tool. The technologies that are incorporated into the modern operating systems and the opportunities that they offer for implementing the modules of the CAIRN architecture are also examined and evaluated. Some of these technologies are common / independent from the operating systems, while some others are distinctive. In this latter case we face barriers (restrictions) to a straightforward implementation of the CAIRN software systems in the whole range of desktop operating systems (e.g. Windows, Mac OS, Linux, Solaris). Some alternative technologies are presented to avoid this serious constraint. Evaluation of the implementation effort is also discussed and eventually some conclusions and future plans for further improvement of the CAIRN architecture are given.
The hard problem of the Text Classification usually has various aspects and potential solutions. In this paper, two main research directions for narrative documents? classification are considered. The first one is based on data mining and rule induction techniques, while the second combines the traditional Text Retrieval techniques (use of the vector space model, index terms, and similarity measures), Natural Language Processing and Instance based Learning techniques. Key-phrases can be used as attributes for mining rules or as a basis for measuring the similarity of new (unclassified) documents with existing (classified) ones. Hence, we eventually focus on the problem of extracting key-phrases from text?s collection in order to use them as attributes for text classification. A new algorithm for the discovery of key-phrases is described. Candidate key-phrases are built using frequent smaller ones and special emphasis is given to the reduction of the complexity of the algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.