In real-world, our DNA is unique but many people share names.is phenomenon o en causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper a ribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using a ributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly be er than the existing name disambiguation methods working in a similar se ing.
In a social community, multiple persons may share the same name, phone number or some other identifying attributes. This, along with other phenomena, such as name abbreviation, name misspelling, and human error leads to erroneous aggregation of records of multiple persons under a single reference. Such mistakes affect the performance of document retrieval, web search, database integration, and more importantly, improper attribution of credit (or blame). The task of entity disambiguation partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is non-intrusive of privacy as it uses only the time-stamped graph topology of an anonymized network. Experimental results on two real-life academic collaboration networks show that the proposed method has satisfactory performance.
Abstract-The entity disambiguation task partitions the records belonging to multiple persons with the objective that each decomposed partition is composed of records of a unique person. Existing solutions to this task use either biographical attributes, or auxiliary features that are collected from external sources, such as Wikipedia. However, for many scenarios, such auxiliary features are not available, or they are costly to obtain. Besides, the attempt of collecting biographical or external data sustains the risk of privacy violation. In this work, we propose a method for solving entity disambiguation task from link information obtained from a collaboration network. Our method is nonintrusive of privacy as it uses only the timestamped graph topology of an anonymized network. Experimental results on two reallife academic collaboration networks show that the proposed method has satisfactory performance.
Job recommendation is an important task for the modern recruitment industry. An excellent job recommender system not only enables to recommend a higher paying job which is maximally aligned with the skill-set of the current job, but also suggests to acquire few additional skills which are required to assume the new position. In this work, we created three types of information networks from the historical job data: (i) job transition network, (ii) job-skill network, and (iii) skill co-occurrence network. We provide a representation learning model which can utilize the information from all three networks to jointly learn the representation of the jobs and skills in the shared k-dimensional latent space. In our experiments, we show that by jointly learning the representation for the jobs and skills, our model provides better recommendation for both jobs and skills. Additionally, we also show some case studies which validate our claims.
The ability to construct domain specific knowledge graphs (KG) and perform question-answering or hypothesis generation is a transformative capability. Despite their value, automated construction of knowledge graphs remains an expensive technical challenge that is beyond the reach for most enterprises and academic institutions. We propose an end-to-end framework for developing custom knowledge graph driven analytics for arbitrary application domains. The uniqueness of our system lies A) in its combination of curated KGs along with knowledge extracted from unstructured text, B) support for advanced trending and explanatory questions on a dynamic KG, and C) the ability to answer queries where the answer is embedded across multiple data sources.
Abstract-We study a bio-detection application as a case study to demonstrate that Kmeans-based unsupervised feature learning can be a simple yet effective alternative to deep learning techniques for small data sets with limited intra-as well as inter-class diversity. We investigate the effect on the classifier performance of data augmentation as well as feature extraction with multiple patch sizes and at different image scales. Our data set includes 1833 images from four different classes of bacteria, each bacterial culture captured at three different wavelengths and overall data collected during a three-day period. The limited number and diversity of images present, potential random effects across multiple days, and the multi-mode nature of class distributions pose a challenging setting for representation learning. Using images collected on the first day for training, on the second day for validation, and on the third day for testing Kmeans-based representation learning achieves 97% classification accuracy on the test data. This compares very favorably to 56% accuracy achieved by deep learning and 74% accuracy achieved by handcrafted features. Our results suggest that data augmentation or dropping connections between units offers little help for deeplearning algorithms, whereas significant boost can be achieved by Kmeans-based representation learning by augmenting data and by concatenating features obtained at multiple patch sizes or image scales.
2018. Neural-Brane: Neural Bayesian Personalized Ranking for Attributed Network Embedding. 1, 1 (August 2018), 15 pages. https: //doi.org/10.1145/nnnnnnn.nnnnnnn INTRODUCTIONThe past few years have witnessed a surge in research on embedding the vertices of a network into a low-dimensional, dense vector space. The embedded vector representation of the vertices in such a vector space enables effortless invocation of off-the-shelf machine learning algorithms, thereby facilitating several downstream network mining tasks, including node classification [19], link prediction [8], community detection [21], job recommendation [5], and entity disambiguation [24]. Most existing network embedding methods, including DeepWalk [14], LINE [17], Node2Vec [8], and SDNE [20], utilize the topological information of a network with the rationale that nodes with similar topological roles should be distributed closely in the learned low-dimensional vector space. While this suffices for node embedding of a bare-bone network, it is inadequate for most of today's network datasets which include useful information beyond link connectivity. Specifically, for most of the social and communication networks, a rich set of nodal attributes is typically available, and more importantly, the similarity between a pair of nodes is dictated significantly by the similarity of their attribute values. Yet, the existing embedding models do not provide a principled approach for incorporating nodal attributes into network embedding and thus fail to achieve the performance boost that may be obtained through modeling attribute based nodal similarity. Intuitively, joint network embedding that consider both attributional and relational information could entail complementary information and further enrich the learned vector representations.We provide a few examples from real-life networks to highlight the importance of vertex attributes for understanding the role of the vertices and to predict their interactions. For example, users on social websites contain biographical profiles like age, gender, and textual comments, which dictate who they befriend with, and what are their common interests. In a citation network, each scientific paper is associated with a title, an abstract, and a publication venue, which largely dictates its future citation patterns. In fact, nodal attributes are specifically important when the network topology fails to capture the similarity between a pair of nodes. For example, in academic domain, two researchers who write scientific papers related to "machine learning" and "information retrieval" are not considered to be similar by existing embedding methods (say, DeepWalk or LINE) unless they are co-authors or they share common collaborators. In such a scenario, node attributes of the researchers (e.g., research keywords) are crucial for compensating for the lack of topological similarity between the researchers. In summary, by jointly considering the attribute homophily and the network topology, more informative node representations can be...
The smallest eigenvalues and the associated eigenvectors (i.e., eigenpairs) of a graph Laplacian matrix have been widely used in spectral clustering and community detection. However, in real-life applications the number of clusters or communities (say, K) is generally unknown a-priori. Consequently, the majority of the existing methods either choose K heuristically or they repeat the clustering method with different choices of K and accept the best clustering result. The first option, more often, yields suboptimal result, while the second option is computationally expensive. In this work, we propose an incremental method for constructing the eigenspectrum of the graph Laplacian matrix. This method leverages the eigenstructure of graph Laplacian matrix to obtain the K-th smallest eigenpair of the Laplacian matrix given a collection of all previously computed K − 1 smallest eigenpairs. Our proposed method adapts the Laplacian matrix such that the batch eigenvalue decomposition problem transforms into an efficient sequential leading eigenpair computation problem. As a practical application, we consider user-guided spectral clustering. Specifically, we demonstrate that users can utilize the proposed incremental method for effective eigenpair computation and for determining the desired number of clusters based on multiple clustering metrics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.