Ted Pedersen scite author profile

Measures of semantic similarity between concepts are widely used in Natural Language Processing. In this article, we show how six existing domain-independent measures can be adapted to the biomedical domain. These measures were originally based on WordNet, an English lexical database of concepts and relations. In this research, we adapt these measures to the SNOMED-CT ontology of medical concepts. The measures include two path-based measures, and three measures that augment path-based measures with information content statistics from corpora. We also derive a context vector measure based on medical corpora that can be used as a measure of semantic relatedness. These six measures are evaluated against a newly created test bed of 30 medical concept pairs scored by three physicians and nine medical coders. We find that the medical coders and physicians differ in their ratings, and that the context vector measure correlates most closely with the physicians, while the path-based measures and one of the information content measures correlates most closely with the medical coders. We conclude that there is a role both for more flexible measures of relatedness based on information derived from corpora, as well as for measures that rely on existing ontological structures.

show abstract

Using Measures of Semantic Relatedness for Word Sense Disambiguation

Patwardhan

2003

View full text Add to dashboard Cite

An evaluation exercise for word alignment

2003

View full text Add to dashboard Cite

show abstract

The Design, Implementation, and Use of the Ngram Statistics Package

Banerjee

Pedersen

2003

142

View full text Add to dashboard Cite

Name Discrimination by Clustering Similar Contexts

Pedersen

Purandare

Kulkarni

2005

View full text Add to dashboard Cite

Abstract. It is relatively common for different people or organizations to share the same name. Given the increasing amount of information available online, this results in the ever growing possibility of finding misleading or incorrect information due to confusion caused by an ambiguous name. This paper presents an unsupervised approach that resolves name ambiguity by clustering the instances of a given name into groups, each of which is associated with a distinct underlying entity. The features we employ to represent the context of an ambiguous name are statistically significant bigrams that occur in the same context as the ambiguous name. From these features we create a co-occurrence matrix where the rows and columns represent the first and second words in bigrams, and the cells contain their log-likelihood scores. Then we represent each of the contexts in which an ambiguous name appears with a second order context vector. This is created by taking the average of the vectors from the co-occurrence matrix associated with the words that make up each context. This creates a high dimensional "instance by word" matrix that is reduced to its most significant dimensions by Singular Value Decomposition (SVD). The different "meanings" of a name are discriminated by clustering these second order context vectors with the method of Repeated Bisections. We evaluate this approach by conflating pairs of names found in a large corpus of text to create ambiguous pseudo-names. We find that our method is significantly more accurate than the majority classifier, and that the best results are obtained by having a small amount of local context to represent the instance, along with a larger amount of context for identifying features, or vice versa.

show abstract

Screening Twitter Users for Depression and PTSD with Lexical Decision Lists

Pedersen

2015

View full text Add to dashboard Cite

show abstract

Towards a framework for developing semantic relatedness reference standards

Pakhomov

Pedersen

McInnes

et al. 2011

Journal of Biomedical Informatics

View full text Add to dashboard Cite

Our objective is to develop a framework for creating reference standards for functional testing of computerized measures of semantic relatedness. Currently, research on computerized approaches to semantic relatedness between biomedical concepts relies on reference standards created for specific purposes using a variety of methods for their analysis. In most cases, these reference standards are not publicly available and the published information provided in manuscripts that evaluate computerized semantic relatedness measurement approaches is not sufficient to reproduce the results. Our proposed framework is based on the experiences of medical informatics and computational linguistics communities and addresses practical and theoretical issues with creating reference standards for semantic relatedness. We demonstrate the use of the framework on a pilot set of 101 medical term pairs rated for semantic relatedness by 13 medical coding experts. While the reliability of this particular reference standard is in the “moderate” range; we show that using clustering and factor analyses offers a data-driven approach to finding systematic differences among raters and identifying groups of potential outliers. We test two ontology-based measures of relatedness and provide both the reference standard containing individual ratings and the R program used to analyze the ratings as open-source. Currently, these resources are intended to be used to reproduce and compare results of studies involving computerized measures of semantic relatedness. Our framework may be extended to the development of reference standards in other research areas in medical informatics including automatic classification, information retrieval from medical records and vocabulary/ontology development.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ted Pedersen

An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet

Measures of semantic similarity and relatedness in the biomedical domain

Using Measures of Semantic Relatedness for Word Sense Disambiguation

An evaluation exercise for word alignment

The Design, Implementation, and Use of the Ngram Statistics Package

Name Discrimination by Clustering Similar Contexts

Screening Twitter Users for Depression and PTSD with Lexical Decision Lists

Towards a framework for developing semantic relatedness reference standards

Contact Info

Product

Resources

About