Automated Text Clustering of Newspaper and Scientific Texts in Brazilian Portuguese: Analysis and Comparison of Methods

Afonso, Alexandre Ribeiro; Duque, Cláudio Gottschalg

doi:10.4301/s1807-17752014000200011

Cited by 12 publications

(9 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Furthermore, text-based approaches are considered superior to citation-based ones for document categorization [3]. The used approaches differ in three aspects: (1) text sections (i.e., abstract, keywords, full text), (2) objective (e.g., classification, recommendation, content extraction, clustering), and (3) used techniques (e.g., bag-of-words, vectorization, Bayesian classifier, topic models, keyword extraction) [1,14,20,41].…”

Section: Related Researchmentioning

confidence: 99%

Towards an Integrative Approach for Automated Literature Reviews Using Machine Learning

Tauchert¹,

Bender²,

Mesbah³

et al. 2020

Proceedings of the Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

Due to a huge amount of scientific publications which are mostly stored as unstructured data, complexity and workload of the fundamental process of literature reviews increase constantly. Based on previous literature, we develop an artifact that partially automates the literature review process from collecting articles up to their evaluation. This artifact uses a custom crawler, the word2vec algorithm, LDA topic modeling, rapid automatic keyword extraction, and agglomerative hierarchical clustering to enable the automatic acquisition, processing, and clustering of relevant literature and subsequent graphical presentation of the results using illustrations such as dendrograms. Moreover, the artifact provides information on which topics each cluster addresses and which keywords they contain. We evaluate our artifact based on an exemplary set of 308 publications. Our findings indicate that the developed artifact delivers better results than known previous approaches and can be a helpful tool to support researchers in conducting literature reviews.

show abstract

Section: Related Researchmentioning

confidence: 99%

Towards an Integrative Approach for Automated Literature Reviews Using Machine Learning

Tauchert¹,

Bender²,

Mesbah³

et al. 2020

Proceedings of the Annual Hawaii International Conference on System Sciences

View full text Add to dashboard Cite

show abstract

“…In [7] it is reported a study aimed at verifying whether an automated clustering process could create the correct clusters for two text corpuses: a scientific corpus having five knowledge fields (Pharmacy, Physical Education, Linguistics, Geography, and History) and a newspaper corpus having five knowledge fields (Human Sciences, Biological Sciences, Social Sciences, Religion and Thought, Exact Sciences). Therefore, the authors had two corpuses already classified by humans and they wanted to measure the effectiveness of the clustering process.…”

Section: Automated Text Classificationmentioning

confidence: 99%

“…A numerical approach to calculate the fractal dimension of a time series is by counting the number of circles of a given fixed diameter that are needed to cover the entire time series [23]. That number is related to the diameter of the circle according to Equation (7).…”

mentioning

confidence: 99%

Written Documents Analyzed as Nature-Inspired Processes: Persistence, Anti-Persistence, and Random Walks—We Remember, as Along Came Writing—T. Holopainen

López-Ortega

Pérez-Cortés

Castillejos-Fernández

et al. 2020

Applied Sciences

View full text Add to dashboard Cite

Written communication is pivotal for societies to develop. However, lexicon and depth of information vary greatly among texts according to their purpose. Scientific texts, diffusion of science reports, general and area-specific news are all written differently. Thus, we explore the characterization of different text categories through a nature-inspired feature known as the Hurst parameter. We contend that the Hurst exponent is useful to unveil the rhetorical structure within written documents. We collected and processed texts in five categories: scientific articles, diffusion of science reports, business news, entertainment news, and random texts. Each category contains 350 documents. We found that the median for scientific texts has the highest value of the Hurst parameter (0.575), followed by business news (0.54); the median for randomly-generated texts is 0.48, which lies in the region associated with random walks. The median value for diffusion texts is 0.49, and for entertainment texts is 0.53. However, these two categories present high dispersion. We conclude that the Hurst parameter is a measure that quantifies the structure of communication in the selected categories of texts. Application of our finding in the field of e-research is discussed.

show abstract

“…A density-based kmeans algorithm is suggested to improve the performance of DBSCAN and K-means algorithms. They utilized a dataset of 250 documents and observed that DBK-means has outperforms the k-means and DBSCAN algorithms [17]. Clustering algorithm founded on density and distance is also utilized, which calculates the distance and the density of every data points and combined those data objects which have minimum distance and highest density, using a decision graph [18].…”

Section: Related Workmentioning

confidence: 99%

“…Various studies regarding document clustering, exploiting English language documents as input have been presented [16]. However, each language can generate distinct levels of exactness, depending on each natural language shapes and characteristics, like morphological and syntax peculiarities, use of antonyms and synonyms, and utilization of native expressions etc [17,18]. Structure of this paper is organized as: section 2 highlights the importance and challenges of Urdu, section 3 describes Atta Ur Rahman et al…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Machine Learning based Documents Clustering in Urdu

Rahman

Khan

et al. 2018

ICST Transactions on Scalable Information Systems

View full text Add to dashboard Cite

The volume of data on the web is growing rapidly, due to the proliferation of news sources, contents, blogs and journals etc. Like other languages, the Urdu language has also observed tremendous growth on the internet. As the volume of data is expanding, information retrieval (IR) is becoming complicated. Document clustering is an unsupervised ML approach, employed to group a huge number of dispersed documents into a small number of significant and consistent clusters, thus providing a base for indexing, IR and browsing mechanisms. Documents clustering has a long tradition in English as well as English like western languages, but Urdu lags behind in terms sophisticated natural language processing (NLP) tools and resources for documents clustering. Documents clustering becomes a challenging task in Urdu language having a rich morphology, particular structure, syntax peculiarities and cursive nature. In this study, we have developed a framework of document clustering and analysed various similarity measures for Urdu documents. We have also checked the effect of stop words removal in the process of Urdu document clustering.

show abstract

Automated Text Clustering of Newspaper and Scientific Texts in Brazilian Portuguese: Analysis and Comparison of Methods

Cited by 12 publications

References 13 publications

Towards an Integrative Approach for Automated Literature Reviews Using Machine Learning

Towards an Integrative Approach for Automated Literature Reviews Using Machine Learning

Written Documents Analyzed as Nature-Inspired Processes: Persistence, Anti-Persistence, and Random Walks—We Remember, as Along Came Writing—T. Holopainen

Unsupervised Machine Learning based Documents Clustering in Urdu

Contact Info

Product

Resources

About