Abstract:The principal aim of this paper is to make a review of main statistical methods for classifying documents that could be easily adapted in the context of Web document retrieval. After presenting the most popular methods of classification we will also define the most accurate indicators for assessment of classifiers performance. Thus we will refer to the recall, precision, fscore, sensitivity and specificity. We will also describe how these indicators can be calculated in the context of Web documents.
In the paper we propose, for an image compression system based on the Karhunen-Loeve Transform implemented by neural networks, to take into consideration the 8 square isometries of an image block. The proper isometry applied puts the 8*8 square image block in a standard position, before applying the image block as input to the neural network architecture. The standard position is defined based on the variance of its four 4*4 sub-blocks (quadro partitioned) and brings the sub-block having the greatest variance in a specific comer and in another specific adjoining corner the sub-block having the second variance (if this is not possible the third is considered). The use of this "preprocessing" phase was expected to improve the learning and representation ability of the network and, therefore, to improve the compression results. Experimental results have proven that the expectations were fulfilled and the isometries are, from now, worth to be taken into consideration. '
Document clustering is a problem of automatically grouping similar document into categories based on some similarity metrics. Almost all available data, usually on the web, are unclassified so we need powerful clustering algorithms that work with these types of data. All common search engines return a list of pages relevant to the user query. This list needs to be generated fast and as correct as possible. For this type of problems, because the web pages are unclassified, we need powerful clustering algorithms. In this paper we present a clustering algorithm called DBSCAN – Density-Based Spatial Clustering of Applications with Noise – and its limitations on documents (or web pages) clustering. Documents are represented using the “bag-of-words” representation (word occurrence frequency). For this type o representation usually a lot of algorithms fail. In this paper we use Information Gain as feature selection method and evaluate the DBSCAN algorithm by its capacity to integrate in the clusters all the samples from the dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.