Incremental Document Clustering Based on Graph Model

Nguyen-Hoang, Tu-Anh; Hoang, Kiem; Bui-Thi, Danh; Nguyen, Anh-Thy

doi:10.1007/978-3-642-03348-3_58

Cited by 6 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper we extend our initial results [20] and present an incremental Vietnamese document clustering approach that overcomes not only the limitations associated with vector space Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.…”

Section: Introductionmentioning

confidence: 79%

Efficient approach for incremental Vietnamese document clustering

Hoang

2009

Proceedings of the Eleventh International Workshop on Web Information and Data Management

Self Cite

View full text Add to dashboard Cite

In this paper, we present how to use graph model for clustering Vietnamese document incrementally. Graph based model allows us to model completely the structure of not only each document but also the whole collection of documents. The graph structure is easily updated when there is a new document. When building the graph incrementally we can identify representative subgraph features, which are later used for calculating hybrid pair-wise document similarity. These subgraph features make clustering process less sensitive to the Vietnamese word segmentation step. Based on the hybrid similarity measure, the documents are groups into clusters on-the-fly without any assumptions on the number of clusters and without retrieving previous documents.

show abstract

Section: Introductionmentioning

confidence: 79%

Efficient approach for incremental Vietnamese document clustering

Hoang

2009

Proceedings of the Eleventh International Workshop on Web Information and Data Management

Self Cite

View full text Add to dashboard Cite

show abstract

“…The phrase-based similarity is calculated from the shared phrases and using cosine and phrase, the hybrid similarity is calculated. In paper (Bakr et al, 2012;Hammouda & Kamel, 2004b;Momin et al, 2006;Nguyen-Hoang et al, 2009) authors have also used hybrid similarity measure to cluster the documents. K-Nearest Neighbour, Single-pass, and Hierarchical Agglomerative clustering algorithm are applied to the dataset using the hybrid similarity as a distance measure, and the quality of clusters is evaluated (Hammouda & Kamel, 2002, 2004a.…”

Section: Document Representation Modelsmentioning

confidence: 99%

Trend analysis and forecasting of publication activities by Indian computer science researchers during the period of 2010–23

Kathiria

Arolkar

2022

Expert Systems

View full text Add to dashboard Cite

Huge collections of published research documents are available in various repositories in Indian universities and research organizations. The efficient retrieval, estimation of research trends, and identification of research gaps with the global trend in different research areas, can be a great guiding tool for regulating research in the appropriate direction. This research attempts to analyse the trend of research activities carried out by Indian researchers in the discipline of Computer Science from 2010 to 19 and forecast the upcoming research trend for the years 2020-23. A repository of the abstracts based on domains given in the Computer Science Ontology (CSO) published by the Indian researchers is developed from the Scopus database. Document Index Graph document representation model is used to store the repository, shared phrases across the documents are extracted and phrase-based similarity is computed. Combining the Single-term and Phrase-based similarity, the hybrid similarity is generated and similar documents are clustered using the DBSCAN clustering technique. Topics are identified for each cluster using the Latent Dirichlet Allocation algorithm and are automatically labelled using CSO. For each topic, the trend analysis and forecasting have been done using the Auto-Regressive Integrated Moving Average. For the assessment of the forecasting performance, the dataset from 2010 to 17 is used as a training dataset and 2018-19 as a testing dataset. The average forecasting for the year 2018 for all CSO domains belongs to the Good forecasting category with Mean Absolute Percentage Error (MAPE) 18.34, and 2019 shows reasonable forecasting with MAPE 30.20 as per the MAPE interpretation given by Lewis. For each topic the average forecasting for years 2018-19 shows either Highly accurate, Good or Reasonable forecasting. As a result, the top four domains for the years 2020-23 are also identified which can help initial researchers in the identification of a relevant topic for research and exploration.

show abstract

“…The sampling approaches (Aggarwal et al, 2009;Cheng et al, 1998;Guha et al, 1998;Kranen et al, 2011;Lee et al, 2009;Ng et al, 2002;Pal et al, 2002;Sakai et al, 2009;Yildizli et al, 2011) usually choose the samples by a certain rule such as chisquare or divergence hypothesis (Hathaway et al, 2006). The incremental approaches (Bradley et al, 1998;Farnstrom et al, 2000;Gupta et al, 2004;Karkkainen et al, 2007;Luhr et al, 2009;Nguyen-Hoang et al, 2009;Ning et al, 2009;O'Callaghan et al, 2002;Ramakrishnan et al, 1996;Siddiqui et al, 2009;Wan et al, 2010Wan et al, , 2011 generally maintain past knowledge from the previous runs of a clustering algorithm to produce or improve the future clustering model. Nevertheless, as Hore et al (2007) pointed out, many existing algorithms for large and very large data sets are used for the crisp case, rarely for the fuzzy case.…”

Section: Introductionmentioning

confidence: 99%

Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

Wan

Gao

2012

International Journal of Data Warehousing and Mining

View full text Add to dashboard Cite

Up to now, several algorithms for clustering large data sets have been presented. Most clustering approaches for data sets are the crisp ones, which cannot be well suitable to the fuzzy case. In this paper, the authors explore a single pass approach to fuzzy possibilistic clustering over large data set. The basic idea of the proposed approach (weighted fuzzy-possibilistic c-means, WFPCM) is to use a modified possibilistic c-means (PCM) algorithm to cluster the weighted data points and centroids with one data segment as a unit. Experimental results on both synthetic and real data sets show that WFPCM can save significant memory usage when comparing with the fuzzy c-means (FCM) algorithm and the possibilistic c-means (PCM) algorithm. Furthermore, the proposed algorithm is of an excellent immunity to noise and can avoid splitting or merging the exact clusters into some inaccurate clusters, and ensures the integrity and purity of the natural classes.

show abstract

Incremental Document Clustering Based on Graph Model

Cited by 6 publications

References 5 publications

Efficient approach for incremental Vietnamese document clustering

Efficient approach for incremental Vietnamese document clustering

Trend analysis and forecasting of publication activities by Indian computer science researchers during the period of 2010–23

Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

Contact Info

Product

Resources

About