Document representation methods for clustering bilingual documents

Ma, Shutian; Zhang, Chengzhi; He, Daqing

doi:10.1002/pra2.2016.14505301065

Cited by 10 publications

(13 citation statements)

References 49 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Experimental results show that the method performs better than tf-idf with/without stops words and word2vec with/without stop words. A major drawback in their work is that, stops words increase the dimensionality of the feature vectors which impacts badly on the classification accuracy and computational burden [8].Also the classification algorithm used was a linear SVM, other kernels such as string and RBF kernels could produce better results [20].…”

Section: Related Work 21 Web Page Classificationmentioning

confidence: 99%

“…To achieve high classification result of the Web Page Classification (WPC) system, an excellent representation of textual data (Preprocessing/DR) should contain as much information as possible from the original document [8]. Also, the accuracy of most classification algorithms depends on the quality and size of training data which is inherently dependent on the document representation technique [9].…”

Section: Introductionmentioning

confidence: 99%

“…Several researchers have contributed to the document representation stage of the web page classification system because irrelevant and redundant features often degrade the performance of the classification algorithms both in speed and classification accuracy and also its tendency to reduce overfitting [10]. Drawing from literature, state-of-the-art DR technique's used in WPC systems are: bag of words model, Term-Frequency Inverse Document Frequency (TF-IDF), Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Indexing (PLSI), Latent Dirichlet Allocation (LDA), LSI and TF-IDF, N-Gram and TF-IDF, Word2Vec and TF-IDF, TF-IDF and firefly Algorithm, Word2Vec and LDA [8], [11], [12], [13], [14], [15]. Each of these technique are fraught with one challenge or the other such as semantic mismatch and multiple meanings of word and so on.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Neural Network Language Document Representation Technique for Web-Page Classification

Quadri¹,

Ajose-Ismail²

2020

IJCA

View full text Add to dashboard Cite

Section: Related Work 21 Web Page Classificationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Neural Network Language Document Representation Technique for Web-Page Classification

Quadri¹,

Ajose-Ismail²

2020

IJCA

View full text Add to dashboard Cite

“…In this paper, we therefore treat the topics as clusters, and apply the Silhouette Coefficient instead. This method has been previously used for finding the optimal number of topics (Panichella et al, 2013;Ma et al, 2016), and is suitable for our LDA approach, since LDA is fully unsupervised. Nevertheless, in future work, it may be worth evaluating some probability measures such as loglikelihood and perplexity, and comparing the performance using these methods.…”

Section: Lda Modelmentioning

confidence: 99%

“…In the silhouette analysis (Ma et al, 2016), silhouette coefficients close to +1 indicate that the samples in the cluster are far away from the neighbouring clusters. In contrast, a negative silhouette coefficient means that the samples might have been assigned to the wrong cluster.…”

Section: Lda Modelmentioning

confidence: 99%

Comparing Attitudes to Climate Change in the Media using sentiment analysis based on Latent Dirichlet Allocation

Jiang¹,

Song²,

Harrison³

et al. 2017

Proceedings of the 2017 EMNLP Workshop: Natural Language Processing Meets Journalism

View full text Add to dashboard Cite

News media typically present biased accounts of news stories, and different publications present different angles on the same event. In this research, we investigate how different publications differ in their approach to stories about climate change, by examining the sentiment and topics presented. To understand these attitudes, we find sentiment targets by combining Latent Dirichlet Allocation (LDA) with SentiWordNet, a general sentiment lexicon. Using LDA, we generate topics containing keywords which represent the sentiment targets, and then annotate the data using SentiWordNet before regrouping the articles based on topic similarity. Preliminary analysis identifies clearly different attitudes on the same issue presented in different news sources. Ongoing work is investigating how systematic these attitudes are between different publications, and how these may change over time.

show abstract

Document representation and clustering models for bilingual documents clustering

Zhang

2017

Proc. Assoc. Info. Sci. Tech.

Self Cite

View full text Add to dashboard Cite

Currently, the internet has created many documents in languages other than English. People face challenges when seeking and using information; for example, non-native English-speaking students tend to have problems when utilizing libraries in North American universities. To help people efficiently organize information, bilingual documents clustering has advantages for practical utilization, it can divide documents into groups with the same topic and there is no need for a training dataset. Document representation and clustering models are two important parts in clustering. This paper compares four popular representation methods, vector space model (VSM), latent semantic indexing (LSI), latent Dirichlet allocation (LDA) and doc2vec (D2V), together with four different types of clustering algorithms, K-means++, BIRCH, DBSCAN and affinity propagation (AP) to identify appropriate combinations for bilingual documents clustering. Parallel corpus and comparable corpus are all used for the bilingual datasets. Experimental results show that, clustering performance varies when combining different representation methods with clustering algorithms. It's important to make good choice of models for better documents organization.

show abstract

Document representation methods for clustering bilingual documents

Cited by 10 publications

References 49 publications

A Neural Network Language Document Representation Technique for Web-Page Classification

A Neural Network Language Document Representation Technique for Web-Page Classification

Comparing Attitudes to Climate Change in the Media using sentiment analysis based on Latent Dirichlet Allocation

Document representation and clustering models for bilingual documents clustering

Contact Info

Product

Resources

About