Abstract:Scientific data are being generated at an ever-increasing rate. The Biomedical and Healthcare Data Discovery Index Ecosystem (bioCADDIE) is an NIH-funded Data Discovery Index that aims to provide a platform for researchers to locate, retrieve, and share research datasets. The bioCADDIE 2016 Dataset Retrieval Challenge was held to identify the most effective dataset retrieval methods. We aimed to assess the value of Medical Subject Heading (MeSH) term-based query expansion to improve retrieval. Our system, base… Show more
“…In biomedical area UMLS, MeSH ( 22 ), SNOMED-CT, ICD-10, WordNet and Wikipedia are used ( 30 ). Generally, the result of lexicon type expansion is positive (in the bioCADDIE contest see for example ( 19 , 20 )). We did not use this method in our work because of lack of access to MeSH medical text indexer service.…”
Section: Resultsmentioning
confidence: 99%
“…Additional runs determined the optimal number of MeSH terms and weighting. Their best overall score used five MeSH terms with a 1:5 terms: words weighting ratio ( 19 ). This is the same ratio we used in our best run when query expanded terms are derived from word2vec.…”
Information retrieval from biomedical repositories has become a challenging task because of their increasing size and complexity. To facilitate the research aimed at improving the search for relevant documents, various information retrieval challenges have been launched. In this article, we present the improved medical information retrieval systems designed by Poznan University of Technology and Poznan University of Medical Sciences as a contribution to the bioCADDIE 2016 challenge—a task focusing on information retrieval from a collection of 794 992 datasets generated from 20 biomedical repositories. The system developed by our team utilizes the Terrier 4.2 search platform enhanced by a query expansion method using word embeddings. This approach, after post-challenge modifications and improvements (with particular regard to assigning proper weights for original and expanded terms), allowed us achieving the second best infNDCG measure (0.4539) compared with the challenge results and infAP 0.3978. This demonstrates that proper utilization of word embeddings can be a valuable addition to the information retrieval process. Some analysis is provided on related work involving other bioCADDIE contributions. We discuss the possibility of improving our results by using better word embedding schemes to find candidates for query expansion.
Database URL: https://biocaddie.org/benchmark-data
“…In biomedical area UMLS, MeSH ( 22 ), SNOMED-CT, ICD-10, WordNet and Wikipedia are used ( 30 ). Generally, the result of lexicon type expansion is positive (in the bioCADDIE contest see for example ( 19 , 20 )). We did not use this method in our work because of lack of access to MeSH medical text indexer service.…”
Section: Resultsmentioning
confidence: 99%
“…Additional runs determined the optimal number of MeSH terms and weighting. Their best overall score used five MeSH terms with a 1:5 terms: words weighting ratio ( 19 ). This is the same ratio we used in our best run when query expanded terms are derived from word2vec.…”
Information retrieval from biomedical repositories has become a challenging task because of their increasing size and complexity. To facilitate the research aimed at improving the search for relevant documents, various information retrieval challenges have been launched. In this article, we present the improved medical information retrieval systems designed by Poznan University of Technology and Poznan University of Medical Sciences as a contribution to the bioCADDIE 2016 challenge—a task focusing on information retrieval from a collection of 794 992 datasets generated from 20 biomedical repositories. The system developed by our team utilizes the Terrier 4.2 search platform enhanced by a query expansion method using word embeddings. This approach, after post-challenge modifications and improvements (with particular regard to assigning proper weights for original and expanded terms), allowed us achieving the second best infNDCG measure (0.4539) compared with the challenge results and infAP 0.3978. This demonstrates that proper utilization of word embeddings can be a valuable addition to the information retrieval process. Some analysis is provided on related work involving other bioCADDIE contributions. We discuss the possibility of improving our results by using better word embedding schemes to find candidates for query expansion.
Database URL: https://biocaddie.org/benchmark-data
“…erefore, the query expansion method is introduced into the QA model, which makes up the semantic gap between questions and answers by adding words related to the answers to the original query. In the field of medical, external medical knowledge resources such as MeSH [9], UMLS [10], and several medical ontology databases [11] are employed as the source of extension words. However, the query expansion only based on synonyms is incapable of accurately capturing the semantic information in the corpus.…”
Section: Question Answering Based On Query Expansionmentioning
With the development of the Internet of Things, intelligent medical devices and intelligent consultation platforms have been rapidly popularized, providing great convenience for medical treatment to patients and consultation to doctors. In the face of large-scale medical electronic information data, how to automatically and accurately learn professional knowledge and realize application is very important. The existing intelligent medical question answering models typically use query expansion to improve the accuracy of model matching answers but ignore the corresponding entity association between questions and answers, and the method of randomly generating negative samples cannot train the model to capture more semantic information. To solve these problems, a question answering method based on dual-dimensional entity association for intelligent medicine is proposed. This method learns semantics from the dual-dimension of question and answer respectively. In the question dimension, query extension words with strong relevance to query intention are obtained through entity association in the medical knowledge graph. In the answer dimension, answer sentences are segmented and sampled by employing a variety of similarity distances to generate negative samples in different ranges, provide different levels of correlation information between entities for model training, and then integrate the trained model to improve the accuracy and robustness of the question answering model. The experimental results show that the question answering model proposed in this paper has a good improvement in accuracy.
“…In addition to that biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) dataset retrieval challenge was organized in 2016 to evaluate the effectiveness of information retrieval (IR) techniques in identifying relevant biomedical datasets in DataMed ( 3 ). Among the teams participated in this shared task, use of probabilistic or machine learning based IR ( 4 ), medical subject headings (MeSH) term based query expansion ( 5 ), word embeddings and identifying named entity ( 6 ), and re-ranking ( 7 ) for searching datasets using a query were the prevalent approaches. Similarly, a specialized search engine named Omicseq was developed for retrieving omics data ( 8 ).…”
It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers’ workload in identifying the right dataset and increase the reusability of biomedical datasets.
Database URL: http://genestudy.org/recommends/#/
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.