Authorship Clustering using TF-IDF weighted Word-Embeddings

Agarwal, Lucky; Thakral, Kartik; Bhatt, Gaurav; Mittal, Ankush

doi:10.1145/3368567.3368572

Cited by 13 publications

(9 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They showed that the former was beneficial for multi-topic texts but it was also more computationally demanding without achieving substantially better performance [20]. Agarwal et al utilized word embedding with tf-idf weights and employed hierarchical clustering algorithms to perform authorship clustering [1]. Kocher and Savoy adopted a simple set of features of the most frequent terms (words and punctuation) to represent the authorship and writing styles [14].…”

Section: Related Workmentioning

confidence: 99%

“…Many authorial clustering approaches invest on advanced machine learning methods, like recurrent neural networks [4], word embedding [1] and sophisticated document representations [4,10,20] with thousands of dimensions. Highdimensional feature spaces, however, tend to get sparser as the texts get shorter and suffer from consequences like the curse of dimensionality.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Trad

Spiliopoulou

2021

Advances in Intelligent Data Analysis XIX

View full text Add to dashboard Cite

Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with nonparametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints).We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500 compared to state-of-the-art. We also demonstrate that little knowledge on constraints in authorial clusters memberships leads to auspicious improvements in front of this difficult task.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Trad

Spiliopoulou

2021

Advances in Intelligent Data Analysis XIX

View full text Add to dashboard Cite

show abstract

“…TF-IDF algorithm is widely used in the following applications (Agarwal et al , 2019; Chang et al , 2020; Dong et al , 2019; Feng et al , 2019; Forman, 2008; Gebre et al , 2013; Huang et al , 2011; Kumar and Subba, 2020; Matsuo and Ishizuka, 2004; Saihanqiqige, 2020; Trstenjak et al , 2014; Yahav et al , 2018; Yunchun, 2019; Yun-Tao et al , 2005; Park et al , 2020): text categorization; keywords extraction; and new word recognition. …”

Section: Related Workmentioning

confidence: 99%

Web-based methodology for extracting technology words in Chinese process patents

Yang

Ren

2020

IJWIS

View full text Add to dashboard Cite

Purpose The purpose of constructing the technology/function matrix is to analyze the patents in the target domain. The extraction of technology words is an important part of the construction of technology/function matrix. This algorithm is used to solve the problem of low efficiency of traditional Chinese process patents technology words extraction. Design/methodology/approach The authors propose a Chinese process patents technology words extraction method based on the improved term frequency–inverse document frequency (TF-IDF) algorithm to help technicians obtain the technology words in the target domain. According to the characteristics of Chinese process patents technology words, the TF value of candidate technology words is divided into four parts, and the corpus of IDF value calculation of candidate technology words is selected. Findings Through the test of Chinese process patents in the domain of path planning, this study shows that the method is feasible and practical. It can help users quickly and accurately obtain the technology words of Chinese process patents in the target domain. Practical implications With the increasing number of patents on the network-based patent information platform, patent analysis of massive Chinese process patents has become a research focus. The method proposed in this paper can facilitate users to extract technology words from massive Chinese process patents for patent analysis. Originality/value This paper aims to improve the efficiency of Chinese process patents technology words extraction. The authors hope that the proposed method can reduce the labor and time cost of Chinese process patents technology words extraction.

show abstract

“…Authorship verification takes as input a set of authors and a set of documents and assigns each document to an author, while authorial clustering assumes that information on authors of documents is unavailable or unreliable. Authorial clustering seeks to partition the set of documents into clusters such that each cluster corresponds to one author [25] 1 .…”

Section: Introductionmentioning

confidence: 99%

“…Many authorial clustering approaches invest on advanced machine learning methods, like recurrent neural networks [3], word embeddings [1] and sophisticated document representations [3,10,23] in a space with thousands of dimensions. High-dimensional feature spaces, however, tend to get sparser as the texts get shorter and suffer from consequences like the curse of dimensionality.…”

Section: Introductionmentioning

confidence: 99%

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Trad¹,

Spiliopoulou²

2020

Preprint

View full text Add to dashboard Cite

Authorial clustering involves the grouping of documents written by the same author or team of authors without any prior positive examples of an author's writing style or thematic preferences. For authorial clustering on shorter texts (paragraph-length texts that are typically shorter than conventional documents), the document representation is particularly important: very high-dimensional feature spaces lead to data sparsity and suffer from serious consequences like the curse of dimensionality, while feature selection may lead to information loss. We propose a high-level framework which utilizes a compact data representation in a latent feature space derived with non-parametric topic modeling. Authorial clusters are identified thereafter in two scenarios: (a) fully unsupervised and (b) semi-supervised where a small number of shorter texts are known to belong to the same author (must-link constraints) or not (cannot-link constraints).We report on experiments with 120 collections in three languages and two genres and show that the topic-based latent feature space provides a promising level of performance while reducing the dimensionality by a factor of 1500 compared to state-of-the-arts. We also demonstrate that, while prior knowledge on the precise number of authors (i.e. authorial clusters) does not contribute much to additional quality, little knowledge on constraints in authorial clusters memberships leads to clear performance improvements in front of this difficult task. Thorough experimentation with standard metrics indicates that there still remains an ample room for improvement for authorial clustering, especially with shorter texts.

show abstract

Authorship Clustering using TF-IDF weighted Word-Embeddings

Cited by 13 publications

References 4 publications

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Web-based methodology for extracting technology words in Chinese process patents

A Framework for Authorial Clustering of Shorter Texts in Latent Semantic Spaces

Contact Info

Product

Resources

About