Insights into Analogy Completion from the Biomedical Domain

Newman-Griffis, Denis; Lai, Albert M.; Fosler‐Lussier, Eric

doi:10.18653/v1/w17-2303

Cited by 19 publications

(15 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It appears that not all relations can be identified in this way, with lexical semantic relations such as synonymy and antonymy being particularly difficult (Köper et al, 2015;Vylomova et al, 2016). The assumption of a single best-fitting candidate answer is also being targeted (Newman-Griffis et al, 2017).…”

Section: Resultsmentioning

confidence: 99%

The (too Many) Problems of Analogical Reasoning with Word Vectors

Rogers¹,

Drozd²,

Li³

2017

Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

View full text Add to dashboard Cite

This paper explores the possibilities of analogical reasoning with vector space models. Given two pairs of words with the same relation (e.g. man:woman :: king:queen), it was proposed that the offset between one pair of the corresponding word vectors can be used to identify the unknown member of the other pair ( −−→ king − −−→ man + − −−−− → woman = ? −−−→ queen). We argue against such "linguistic regularities" as a model for linguistic relations in vector space models and as a benchmark, and we show that the vector offset (as well as two other, better-performing methods) suffers from dependence on vector similarity.

show abstract

Section: Resultsmentioning

confidence: 99%

The (too Many) Problems of Analogical Reasoning with Word Vectors

Rogers¹,

Drozd²,

Li³

2017

Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

View full text Add to dashboard Cite

show abstract

“…Pyysalo et al [73] train a Skip-gram [72] model on document titles and abstracts from the PubMed XML dataset, and all text content of the PMC Open Access dataset. Newman-Griffis et al [70] and Chen et al [71] train GloVe [69], Skip-gram, and Continuous Bag of Words (CBOW) [72] models using PubMed information, whilst Zhang et al [22] and Chen et al [71] train FastText [23] models using PubMed and MeSH. Blagec et al [28] introduce a set of neural embedding models based on the training of FastText [23], Sent2Vec [26], Paragraph vector [29], and Skip-thoughts vectors [30] models on the PMC dataset.…”

Section: Methods Proposed For the Biomedical Domainmentioning

confidence: 99%

“…Availability of the pre-trained models. We have already gathered all the pre-trained embeddings [22,25,70,71,73,77,114,115] and BERT-based language models [31,32,[79][80][81]116] required for our experiments. We have also checked the validity of all pre-trained model files by testing the evaluation of the models using the third-party libraries as detailed below.…”

Section: Integration Of the Biomedical Ontologies And Thesaurus Recently Published Hesml V1r5mentioning

confidence: 99%

Protocol for a reproducible experimental survey on biomedical sentence similarity

2021

View full text Add to dashboard Cite

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

show abstract

“…Second, we used a previously reported analogy completion task [22]. For every combination of pairs, representing an analogy a : b :: c : d , we calculated the cosine similarity between d and the single closest vocabulary word in the embedding space to a − b + c .…”

Section: Methodsmentioning

confidence: 99%

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information

Ungar

2019

Preprint

View full text Add to dashboard Cite

Rationale: Word embeddings are used to create vector representations of text data but not all embeddings appropriately capture clinical information, are free of protected health information, and are computationally accessible to most researchers. Methods: We trained word embeddings on published case reports because their language mimics that of clinical notes, the manuscripts are already de-identified by virtue of being published, and the corpus is much smaller than those trained on large, publicly available datasets. We tested the performance of these embeddings across five clinically relevant tasks and compared the results to embeddings trained on a large Wikipedia corpus, all publicly available manuscripts, notes from the MIMIC-III database using fastText, GloVe, and word2vec, and using different dimensions. Tasks included clinical applications of lexicographic coverage, semantic similarity, clustering purity, linguistic regularity, and mortality prediction. **Results:** The embeddings trained using the published case reports performed as well as if not better on most tasks than those using other corpora. The embeddings trained using all published manuscripts had the most consistent performance across all tasks and required a corpus with 100 times as many tokens as the corpus comprised of only case reports. Embeddings trained on the MIMIC-III dataset had small but marginally better scores on the clustering tasks which was also based on clinical notes from the MIMIC-III dataset. Embeddings trained on the Wikipedia corpus, although containing almost twice as many tokens as all available published manuscripts, performed poorly compared to those trained on medical and clinical corpora. **Conclusion:** Word embeddings trained on freely available published case reports performed well for most clinical task, are free of protected health information, and are small compared to commonly used embeddings trained on larger clinical and non-clinical corpora. The optimal corpus, dimension size, and which embedding model to use for a given task involves tradeoffs in privacy, reproducibility, performance, and computational resources.

show abstract

Insights into Analogy Completion from the Biomedical Domain

Cited by 19 publications

References 19 publications

The (too Many) Problems of Analogical Reasoning with Word Vectors

The (too Many) Problems of Analogical Reasoning with Word Vectors

Protocol for a reproducible experimental survey on biomedical sentence similarity

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information

Contact Info

Product

Resources

About