Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

Karisani, Payam; Qin, Zhaohui S.; Agichtein, Eugene

doi:10.1093/database/bax104

Cited by 7 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While messy, unstructured data makes up the majority of data created on a daily basis, estimated to be 95% of all "big data" [35]. Alongside IR's ability to read unstructured, semistructred data -data that possess attributes of both structured and unstructured -and structured data, it is the ability to consume the sheer quantity of data [36], the ability to filter data [37], as well as the ability to process and categorize data that has provided it's ultimate strength [31]. Categorizing data into various classes, the act of grouping alike data, can be accomplished in a variety of ways, one of which is clustering.…”

Section: Importance Of Information Retrieval Systemsmentioning

confidence: 99%

Discovering Related Terms and Detecting Trends in Software Engineering Using Word Embeddings

Baskararajah

2024

Preprint

View full text Add to dashboard Cite

<p>The Software Engineering (SE) community is prolific, making it challenging for experts to keep up with the flood of new papers and for neophytes to enter the field. One solution that has been proposed to ease the burden of entry on the community would be automatic summarization of papers. While there exist term and trend summarization and analysis tools, the unique language utilized in SE requires bespoke solutions. Therefore, we posit that the community may benefit from a tool extracting terms and their interrelations from the SE community's text corpus and showing terms' trends. In this paper, we build a prototyping tool using the word embedding technique. We train the embeddings on the SE Body of Knowledge handbook and 15,233 research papers' titles and abstracts. We create test cases necessary for validation of the training of the embeddings. Upon gathering the trends of interrelated SE terms, we also use cluster analysis to investigate the trends, to help discover underlying patterns in the way trends in SE rise and fall in popularity. We provide representative examples showing that the embeddings may aid in summarizing terms and uncovering trends in the knowledge base, as well as showing examples of patterns that may lie underneath trends in interrelated terms in software engineering.</p>

show abstract

Section: Importance Of Information Retrieval Systemsmentioning

confidence: 99%

Discovering Related Terms and Detecting Trends in Software Engineering Using Word Embeddings

Baskararajah

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…This classification is available (in tabular form) at shorturl.at/D1234. [24], [15] [18], [17], [20], [22], [25], Techniques [27], [28], [43], [64], [30], [32], [34], [38], [40], [45], [46], [47], [49], [54], [55], [56], [57], [59], [60], [61], [66], [68], [69], [70], [73], [74], [75], [77], [26], [42], [44], [11], [13], [36]. [12], [14], [16], [19], [21], [23], [29], [31], [35], [39], [41], …”

Section: Data Extraction and Classificationmentioning

confidence: 99%

Data Curation and Optimization Techniques: A Systematic Mapping Study

Azevedo

Musicante

Costa

2022

Preprint

View full text Add to dashboard Cite

We develop a Systematic Mapping Study to observe trends and research opportunities around the concepts and techniques used in Data Curation for Big Data. Our work investigates scientific publications with the aim of identifying how data curation has been used recently, to organize and publish big data corpora. We are interested in browsing, identifying the mathematical and computational tools used in data curation. We focus on identifying how data curation has been modeled in different scenarios and which computational/mathematical techniques have contributed to improve data curation, with the aim of answering the following questions: (i) What mathematical fields have most contributed in the context of Data Curation? (ii) Which classes of optimization algorithms are used in the context of Data Curation? (iii) What application domains have benefited the most from Data Curation? While our main focus is on the definition of new methods and algorithms, we identified a large number of papers that concentrates just on applying known techniques to specific domains. Our study may be useful to identify challenges and opportunities for further theoretical studies, as well as to show the use of some formal techniques in real-life applications.

show abstract

“…In addition to that biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) dataset retrieval challenge was organized in 2016 to evaluate the effectiveness of information retrieval (IR) techniques in identifying relevant biomedical datasets in DataMed ( 3 ). Among the teams participated in this shared task, use of probabilistic or machine learning based IR ( 4 ), medical subject headings (MeSH) term based query expansion ( 5 ), word embeddings and identifying named entity ( 6 ), and re-ranking ( 7 ) for searching datasets using a query were the prevalent approaches. Similarly, a specialized search engine named Omicseq was developed for retrieving omics data ( 8 ).…”

Section: Introductionmentioning

confidence: 99%

A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository

Patra

Roberts

2020

Database

View full text Add to dashboard Cite

It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers’ workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/

show abstract

Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

Cited by 7 publications

References 18 publications

Discovering Related Terms and Detecting Trends in Software Engineering Using Word Embeddings

Discovering Related Terms and Detecting Trends in Software Engineering Using Word Embeddings

Data Curation and Optimization Techniques: A Systematic Mapping Study

A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository

Contact Info

Product

Resources

About