“…Motivation. Clustering biomedical terms into concepts in the UMLS Metathesaurus was formalized into a vocabulary alignment problem identified as UMLS Vocabulary Alignment (UVA) or synonymy prediction task by (Nguyen et al, 2021). The UVA is different from other biomedical ontology alignment efforts by the Ontology Alignment Evaluation Initiative (OAEI) due to the extremely large problem size of the UVA with the need to compare 8.7M biomedical terms pairwise (as opposed to tens of thousands of pairs in OAEI datasets).…”
Section: Introductionmentioning
confidence: 99%
“…The UVA is different from other biomedical ontology alignment efforts by the Ontology Alignment Evaluation Initiative (OAEI) due to the extremely large problem size of the UVA with the need to compare 8.7M biomedical terms pairwise (as opposed to tens of thousands of pairs in OAEI datasets). The authors of (Nguyen et al, 2021) also introduced a scalable supervised learning approach based on the Siamese neural architecture which leverages the lexical information present in the terms. Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al, 2019) is a language model (LM), based on the multi-layer, bidirectional architecture of Transformers (Vaswani et al, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…We identify BERT-based models (in this work BERT-based models refer to BioBERT, BLUEBERT, SapBERT and UmlsBERT) and use them as baselines without further pretraining or fine-tuning on the UVA task. Another baseline used in our work is the LexLM provided by (Nguyen et al, 2021). Then we design experiments to pretrain UBERT from scratch (without using any trained weights from other biomedical or clinical BERT-based models) resulting in three variants of UBERT.…”
Section: Introductionmentioning
confidence: 99%
“…Nguyen et al (Nguyen et al, 2021) have elaborated the background knowledge required to understand the UVA task. In this section we will briefly summarize it.…”
Section: Introductionmentioning
confidence: 99%
“…The UMLS Metathesaurus contains approximately ten million English atom strings, each of which being linked to a concept. Since the authors of (Nguyen et al, 2021) focus on assessing whether two atoms are synonymous and should be associated with the same concept, the problem is formulated as a similarity task. We maintain this same problem definition from (Nguyen et al, 2021).…”
The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.
“…Motivation. Clustering biomedical terms into concepts in the UMLS Metathesaurus was formalized into a vocabulary alignment problem identified as UMLS Vocabulary Alignment (UVA) or synonymy prediction task by (Nguyen et al, 2021). The UVA is different from other biomedical ontology alignment efforts by the Ontology Alignment Evaluation Initiative (OAEI) due to the extremely large problem size of the UVA with the need to compare 8.7M biomedical terms pairwise (as opposed to tens of thousands of pairs in OAEI datasets).…”
Section: Introductionmentioning
confidence: 99%
“…The UVA is different from other biomedical ontology alignment efforts by the Ontology Alignment Evaluation Initiative (OAEI) due to the extremely large problem size of the UVA with the need to compare 8.7M biomedical terms pairwise (as opposed to tens of thousands of pairs in OAEI datasets). The authors of (Nguyen et al, 2021) also introduced a scalable supervised learning approach based on the Siamese neural architecture which leverages the lexical information present in the terms. Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al, 2019) is a language model (LM), based on the multi-layer, bidirectional architecture of Transformers (Vaswani et al, 2017).…”
Section: Introductionmentioning
confidence: 99%
“…We identify BERT-based models (in this work BERT-based models refer to BioBERT, BLUEBERT, SapBERT and UmlsBERT) and use them as baselines without further pretraining or fine-tuning on the UVA task. Another baseline used in our work is the LexLM provided by (Nguyen et al, 2021). Then we design experiments to pretrain UBERT from scratch (without using any trained weights from other biomedical or clinical BERT-based models) resulting in three variants of UBERT.…”
Section: Introductionmentioning
confidence: 99%
“…Nguyen et al (Nguyen et al, 2021) have elaborated the background knowledge required to understand the UVA task. In this section we will briefly summarize it.…”
Section: Introductionmentioning
confidence: 99%
“…The UMLS Metathesaurus contains approximately ten million English atom strings, each of which being linked to a concept. Since the authors of (Nguyen et al, 2021) focus on assessing whether two atoms are synonymous and should be associated with the same concept, the problem is formulated as a similarity task. We maintain this same problem definition from (Nguyen et al, 2021).…”
The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.
Ontology Matching (OM) plays an important role in many domains such as bioinformatics and the Semantic Web, and its research is becoming increasingly popular, especially with the application of machine learning (ML) techniques. Although the Ontology Alignment Evaluation Initiative (OAEI) represents an impressive effort for the systematic evaluation of OM systems, it still suffers from several limitations including limited evaluation of subsumption mappings, suboptimal reference mappings, and limited support for the evaluation of ML-based systems. To tackle these limitations, we introduce five new biomedical OM tasks involving ontologies extracted from Mondo and UMLS. Each task includes both equivalence and subsumption matching; the quality of reference mappings is ensured by human curation, ontology pruning, etc.; and a comprehensive evaluation framework is proposed to measure OM performance from various perspectives for both ML-based and non-ML-based OM systems. We report evaluation results for OM systems of different types to demonstrate the usage of these resources, all of which are publicly available as part of the new Bio-ML track at OAEI 2022.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.