Wikipedia-Based Document Categorization

Ciesielski, Krzysztof; Borkowski, Piotr; Mieczys,; K, aw A.; opotek,; Trojanowski, Krzysztof; Wysocki, Kamil

doi:10.1007/978-3-642-25261-7_21

Cited by 8 publications

(8 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Several methods of automatic Polish text categorization and clustering have been implemented and examined over the past years. Ciesielski et al [2] presented a novel method of text categorization based on the Polish Wikipedia resources. Kuta and Kitowski [3] use clustering algorithms applied to two different corpora of the Polish language.…”

Section: Background and Related Workmentioning

confidence: 99%

Experiment on Methods for Clustering and Categorization of Polish Text

Wielgosz

Frączek

Russek

et al. 2017

cai

View full text Add to dashboard Cite

The main goal of this work was to experimentally verify the methods for a challenging task of categorization and clustering Polish text. Supervised and unsupervised learning was employed respectively for the categorization and clustering. A profound examination of the employed methods was done for the custom-built corpus of Polish texts. The corpus was assembled by the authors from Internet resources. The corpus data was acquired from the news portal and, therefore, it was sorted by type by journalists according to their specialization. The presented algorithms employ Vector Space Model (VSM) and TF-IDF (Term Frequency-Inverse Document Frequency) weighing scheme. Series of experiments were conducted that revealed certain properties of algorithms and their accuracy. The accuracy of algorithms was elaborated regarding their ability to match human arrangement of the documents by the topic. For both the categorization and clustering, the authors used F-measure to assess the quality of allocation.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Experiment on Methods for Clustering and Categorization of Polish Text

Wielgosz

Frączek

Russek

et al. 2017

cai

View full text Add to dashboard Cite

show abstract

“…Here, of course, it is necessary to use a shallow analysis of natural language, identifying named entities and the use of appropriate semantic resources (lists of individuals or organizations or types of organizations that are trustworthy). Also one needs methods for appropriate classification of the content of the page [3] to match it against the list of experts. [8] proposes a number of methods for assessing the quality of Web pages edited by communities, in which the method of time series analysis of changes and of the list of readers / writers is exploited.…”

Section: Measuring Information Qualitymentioning

confidence: 99%

“…Within the system NEKST the following types of semantic transformations have been implemented: -user suggestions [22], -substitution with synonyms, hypernyms, hyponyms and other related concepts, -concept disambiguation [3], -document categorization [3], -personalized PageRank [15], -cluster analysis and assignment of cluster keywords to documents [2], -explicit separation of document cluster and document search, -extraction of named entities and relations between them [23], -diversification of responses to queries, -dynamic summarizing [13], and -identification and classification of harmful contents.…”

Section: Measuring Utilitymentioning

confidence: 99%

What is the Value of Information - Search Engine’s Point of View

Kłopotek

2013

Computer Information Systems and Industrial Management

Self Cite

View full text Add to dashboard Cite

Abstract. Within the domain of Information Retrieval, and in particular in the area of Web Search Engines, it has become obvious long time ago that there is a deep discrepancy between how the information is understood within computer science and by the man-in-the-street.We want to make an overview of ways how the apparent gap can be closed using tools that are technologically available nowadays.The key to a success probably lies in approximating (by means of artificial intelligence) the way people judge the value of information.

show abstract

“…Via this component the traditional notion of document similarity (based on angles between vectors in term space) is amended to include the concept of semantic similarity. The notion of semantic similarity, as used in this paper, was described in [1]. Both methods introduced in the paper are based on our SemCat (Semantic Categorizer) algorithm, that has also been introduced in [1].…”

Section: Introductionmentioning

confidence: 99%

“…The notion of semantic similarity, as used in this paper, was described in [1]. Both methods introduced in the paper are based on our SemCat (Semantic Categorizer) algorithm, that has also been introduced in [1].…”

Section: Introductionmentioning

confidence: 99%

Semantic classifier approach to document classification

Borkowski¹,

Ciesielski²,

Kłopotek³

2017

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we propose a new document classification method, bridging discrepancies (so-called semantic gap) between the training set and the application sets of textual data. We demonstrate its superiority over classical text classification approaches, including traditional classifier ensembles. The method consists in combining a document categorization technique with a single classifier or a classifier ensemble (SemCom algorithm -Committee with Semantic Categorizer).

show abstract

Wikipedia-Based Document Categorization

Cited by 8 publications

References 4 publications

Experiment on Methods for Clustering and Categorization of Polish Text

Experiment on Methods for Clustering and Categorization of Polish Text

What is the Value of Information - Search Engine’s Point of View

Semantic classifier approach to document classification

Contact Info

Product

Resources

About