Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

Cha, Miriam; Gwon, Youngjune; Kung, H. T.

doi:10.1145/3132847.3133104

Cited by 32 publications

(15 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, the colors represent the author of the text. We observe that clustering doc2vec embeddings has been used extensively in language analysis (see, e.g., [8]). (ii) victorian 5 .…”

Section: Datasetsmentioning

confidence: 93%

Clustering without Over-Representation

Ahmadian

Epasto

Kumar

et al. 2019

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

In this paper we consider clustering problems in which each point is endowed with a color. The goal is to cluster the points to minimize the classical clustering cost but with the additional constraint that no color is over-represented in any cluster. This problem is motivated by practical clustering settings, e.g., in clustering news articles where the color of an article is its source, it is preferable that no single news source dominates any cluster.For the most general version of this problem, we obtain an algorithm that has provable guarantees of performance; our algorithm is based on finding a fractional solution using a linear program and rounding the solution subsequently. For the special case of the problem where no color has an absolute majority in any cluster, we obtain a simpler combinatorial algorithm also with provable guarantees. Experiments on real-world data shows that our algorithms are effective in finding good clustering without over-representation. CCS CONCEPTS• Information systems → Clustering; Data mining; • Theory of computation → Facility location and clustering; Unsupervised learning and clustering. ACM Reference Format:

show abstract

“…Here, the colors represent the author of the text. We observe that clustering doc2vec embeddings has been used extensively in language analysis (see, e.g., [8]). (ii) victorian 5 .…”

Section: Datasetsmentioning

confidence: 93%

Clustering without Over-Representation

Ahmadian

Epasto

Kumar

et al. 2019

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

show abstract

“…The classification model used is the regularized neural network with one hidden layer. • SG-KM-SVM is a word embedding-based readability assessment method proposed by Cha et al (2017). In SG-KM-SVM, the representation of a document is generated by applying average pooling on the word embedding and cluster membership of all words in the document.…”

Section: Comparisons To the State-of-the-art Methodsmentioning

confidence: 99%

GRAW+: A two‐view graph propagation method with word coupling for readability assessment

Jiang

Yin

et al. 2019

Asso for Info Science & Tech

View full text Add to dashboard Cite

Existing methods for readability assessment usually construct inductive classification models to assess the readability of singular text documents based on extracted features, which have been demonstrated to be effective. However, they rarely make use of the interrelationship among documents on readability, which can help increase the accuracy of readability assessment. In this article, we adopt a graphbased classification method to model and utilize the relationship among documents using the coupled bag-of-words model. We propose a word coupling method to build the coupled bag-of-words model by estimating the correlation between words on reading difficulty. In addition, we propose a two-view graph propagation method to make use of both the coupled bag-of-words model and the linguistic features. Our method employs a graph merging operation to combine graphs built according to different views, and improves the label propagation by incorporating the ordinal relation among reading levels. Experiments were conducted on both English and Chinese data sets, and the results demonstrate both effectiveness and potential of the method.

show abstract

“…Then, inspired by some works of Natural Language Process (NLP), we first use K-Means clustering algorithm to classify the words into a specific group according to pretrained word embeddings (Cha, Gwon, and Kung 2017). For example, verb and noun will be classified into two different groups.…”

Section: Methodsmentioning

confidence: 99%

Location-Based End-to-End Speech Recognition with Multiple Language Models

Lin

Lin²,

Chen

et al. 2019

AAAI

View full text Add to dashboard Cite

End-to-End deep learning approaches for Automatic Speech Recognition (ASR) has been a new trend. In those approaches, starting active in many areas, language model can be considered as an important and effective method for semantic error correction. Many existing systems use one language model. In this paper, however, multiple language models (LMs) are applied into decoding. One LM is used for selecting appropriate answers and others, considering both context and grammar, for further decision. Experiment on a general location-based dataset show the effectiveness of our method.

show abstract

Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

Cited by 32 publications

References 14 publications

Clustering without Over-Representation

Clustering without Over-Representation

GRAW+: A two‐view graph propagation method with word coupling for readability assessment

Location-Based End-to-End Speech Recognition with Multiple Language Models

Contact Info

Product

Resources

About