2014
DOI: 10.4301/s1807-17752014000200011
|View full text |Cite
|
Sign up to set email alerts
|

Automated Text Clustering of Newspaper and Scientific Texts in Brazilian Portuguese: Analysis and Comparison of Methods

Abstract: This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by us… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 16 publications
0
9
0
Order By: Relevance
“…Furthermore, text-based approaches are considered superior to citation-based ones for document categorization [3]. The used approaches differ in three aspects: (1) text sections (i.e., abstract, keywords, full text), (2) objective (e.g., classification, recommendation, content extraction, clustering), and (3) used techniques (e.g., bag-of-words, vectorization, Bayesian classifier, topic models, keyword extraction) [1,14,20,41].…”
Section: Related Researchmentioning
confidence: 99%
“…Furthermore, text-based approaches are considered superior to citation-based ones for document categorization [3]. The used approaches differ in three aspects: (1) text sections (i.e., abstract, keywords, full text), (2) objective (e.g., classification, recommendation, content extraction, clustering), and (3) used techniques (e.g., bag-of-words, vectorization, Bayesian classifier, topic models, keyword extraction) [1,14,20,41].…”
Section: Related Researchmentioning
confidence: 99%
“…In [7] it is reported a study aimed at verifying whether an automated clustering process could create the correct clusters for two text corpuses: a scientific corpus having five knowledge fields (Pharmacy, Physical Education, Linguistics, Geography, and History) and a newspaper corpus having five knowledge fields (Human Sciences, Biological Sciences, Social Sciences, Religion and Thought, Exact Sciences). Therefore, the authors had two corpuses already classified by humans and they wanted to measure the effectiveness of the clustering process.…”
Section: Automated Text Classificationmentioning
confidence: 99%
“…A numerical approach to calculate the fractal dimension of a time series is by counting the number of circles of a given fixed diameter that are needed to cover the entire time series [23]. That number is related to the diameter of the circle according to Equation (7).…”
mentioning
confidence: 99%
“…A density-based kmeans algorithm is suggested to improve the performance of DBSCAN and K-means algorithms. They utilized a dataset of 250 documents and observed that DBK-means has outperforms the k-means and DBSCAN algorithms [17]. Clustering algorithm founded on density and distance is also utilized, which calculates the distance and the density of every data points and combined those data objects which have minimum distance and highest density, using a decision graph [18].…”
Section: Related Workmentioning
confidence: 99%
“…Various studies regarding document clustering, exploiting English language documents as input have been presented [16]. However, each language can generate distinct levels of exactness, depending on each natural language shapes and characteristics, like morphological and syntax peculiarities, use of antonyms and synonyms, and utilization of native expressions etc [17,18]. Structure of this paper is organized as: section 2 highlights the importance and challenges of Urdu, section 3 describes Atta Ur Rahman et al…”
Section: Introductionmentioning
confidence: 99%