Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion

Cen, Lei; Dragut, Eduard C.; Si, Luo; Ouzzani, Mourad

doi:10.1145/2484028.2484157

Cited by 41 publications

(25 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each name dataset, they calculate a Gram matrix representing similarities between different citations and apply K-way spectral clustering algorithm on the Gram matrix to obtain the desired clusters of the citations. In another unsupervised approach, Cen et al [5] compute pairwise similarity for publication events that share the same author name string (ANS) and then use a novel hierarchical agglomerative clustering with adaptive stopping criterion (HACASC) to partition the publications in different author clusters. Malin [17] proposes another clusterbased method that uses social network structure.…”

Section: Related Workmentioning

confidence: 99%

“…Existing works mostly use biographical features, such as name, address, institutional affiliation, email address, and homepage; contextual features, such as coauthor/collaborator, and research keywords; and external data such as Wikipedia [7]. From methodological point of view, some of the works follow a supervised learning approach [8,10], while others use unsupervised clustering [5,9,17,25]. There exist quite a few solutions that use graphical models [3,23,26,31].…”

mentioning

confidence: 99%

See 1 more Smart Citation

Name disambiguation from link data in a collaboration graph using temporal and topological features

Saha

Zhang

Hasan

2015

Soc. Netw. Anal. Min.

View full text Add to dashboard Cite

In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.

show abstract

Section: Related Workmentioning

confidence: 99%

mentioning

confidence: 99%

Name disambiguation from link data in a collaboration graph using temporal and topological features

Saha

Zhang

Hasan

2015

Soc. Netw. Anal. Min.

View full text Add to dashboard Cite

show abstract

“…We use LDA (Blei, Ng, and Jordan 2003), HC (Chang, Pei, and Chen 2014) and STM (Wang et al 2015) as baselines. We do not compare with non-text feature-based models (Tang et al 2012;Cen et al 2013) because our goal is to compare sense topic models on a task where the sense granularities are more varied. For STM and AutoSense, the title, publication venue and the author names are used as local contexts while the abstract is used as the global context.…”

Section: Methodsmentioning

confidence: 99%

AutoSense Model for Word Sense Induction

Amplayo

Hwang

Song

2019

AAAI

View full text Add to dashboard Cite

Word sense induction (WSI), or the task of automatically discovering multiple senses or meanings of a word, has three main challenges: domain adaptability, novel sense detection, and sense granularity flexibility. While current latent variable models are known to solve the first two challenges, they are not flexible to different word sense granularities, which differ very much among words, from aardvark with one sense, to play with over 50 senses. Current models either require hyperparameter tuning or nonparametric induction of the number of senses, which we find both to be ineffective. Thus, we aim to eliminate these requirements and solve the sense granularity problem by proposing AutoSense, a latent variable model based on two observations: (1) senses are represented as a distribution over topics, and (2) senses generate pairings between the target word and its neighboring word. These observations alleviate the problem by (a) throwing garbage senses and (b) additionally inducing fine-grained word senses. Results show great improvements over the stateof-the-art models on popular WSI datasets. We also show that AutoSense is able to learn the appropriate sense granularity of a word. Finally, we apply AutoSense to the unsupervised author name disambiguation task where the sense granularity problem is more evident and show that AutoSense is evidently better than competing models. We share our data and code here: https://github.com/rktamplayo/AutoSense.

show abstract

“…Due to its importance, the name disambiguation task has attracted substantial a ention from information retrieval and data mining communities. However, the majority of existing solutions [1,3,12,15] for this task use biographical features such as name, address, institutional affiliation, email address, and homepage. Also, contextual features such as collaborator, community affiliation, and external data source such as Wikipedia are used in some works [13,15].…”

Section: Introductionmentioning

confidence: 99%

Name Disambiguation in Anonymized Graphs using Network Embedding

Zhang

Hasan

2017

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

103

View full text Add to dashboard Cite

In real-world, our DNA is unique but many people share names.is phenomenon o en causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper a ribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using a ributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly be er than the existing name disambiguation methods working in a similar se ing.

show abstract

Author disambiguation by hierarchical agglomerative clustering with adaptive stopping criterion

Cited by 41 publications

References 15 publications

Name disambiguation from link data in a collaboration graph using temporal and topological features

Name disambiguation from link data in a collaboration graph using temporal and topological features

AutoSense Model for Word Sense Induction

Name Disambiguation in Anonymized Graphs using Network Embedding

Contact Info

Product

Resources

About