Background
The COVID-19 global health crisis has led to an exponential surge in published scientific literature. In an attempt to tackle the pandemic, extremely large COVID-19–related corpora are being created, sometimes with inaccurate information, which is no longer at scale of human analyses.
Objective
In the context of searching for scientific evidence in the deluge of COVID-19–related literature, we present an information retrieval methodology for effective identification of relevant sources to answer biomedical queries posed using natural language.
Methods
Our multistage retrieval methodology combines probabilistic weighting models and reranking algorithms based on deep neural architectures to boost the ranking of relevant documents. Similarity of COVID-19 queries is compared to documents, and a series of postprocessing methods is applied to the initial ranking list to improve the match between the query and the biomedical information source and boost the position of relevant documents.
Results
The methodology was evaluated in the context of the TREC-COVID challenge, achieving competitive results with the top-ranking teams participating in the competition. Particularly, the combination of bag-of-words and deep neural language models significantly outperformed an Okapi Best Match 25–based baseline, retrieving on average, 83% of relevant documents in the top 20.
Conclusions
These results indicate that multistage retrieval supported by deep learning could enhance identification of literature for COVID-19–related questions posed using natural language.
A recent trend in health-related machine learning proposes the use of Graph Neural Networks (GNN's) to model biomedical data. This is justified due to the complexity of healthcare data and the modelling power of graph abstractions. Thus, GNN's emerge as the natural choice to learn from increasing amounts of healthcare data. While formulating the problem, however, there are usually multiple design choices and decisions that can affect the final performance. In this work, we focus on Clinical Trial (CT) protocols consisting of hierarchical documents, containing free text as well as medical codes and terms, and design a classifier to predict each CT protocol termination risk as "low" or "high". We show that while using GNN's to solve this classification task is very successful, the way the graph is constructed is also of importance and one can benefit from making a priori useful information more explicit. While a natural choice is to consider each CT protocol as an independent graph and pose the problem as a graph classification, consistent performance improvements can be achieved by considering them as super-nodes in one unified graph and connecting them according to some metadata, like similar medical condition or intervention, and finally approaching the problem as a node classification task rather than graph classification. We validate this hypothesis experimentally on a large-scale manually labeled CT database. This provides useful insights on the flexibility of graphbased modeling for machine learning in the healthcare domain.
In the context of searching for COVID-19 related scientific literature, we present an information retrieval methodology for effectively finding relevant publications for different information needs. We discuss different components of our architecture consisting of traditional information retrieval models, as well as modern neural natural language processing algorithms. We present recipes to better adapt these components to the case of an infodemic, where, from one hand, the number of publications has an exponential growth and, from the other hand, the topics of interest evolve as the pandemic progresses. The methodology was evaluated in the TREC-COVID challenge, achieving competitive results with top ranking teams participating in the competition. In retrospect to this challenge, we provide additional insights with further useful impacts.
As the world's population continues to expand, maritime transport is critical to ensure economic growth. To improve security and safety of maritime transportation, the Automatic Identification System AQ1 (AIS) collects real-time data about vessels and their positions. While a large portion of the AIS data is provided via an automatic tracking system, some key fields, such as destination and draught, are entered manually by the ship navigator and are thus prone to errors. To support decision making in maritime industries, in this paper we propose a datadriven vessel destination prediction algorithm based on heterogeneous graph and machine learning models. We design the task as a multi-class classification problem, where the destination port is the category to be predicted given the vessel and origin information. Then, we use a link prediction model in a weighted heterogeneous graph to predict the vessel destination. Experimental comparison against baseline methods, such as logistic regression and k-nearest neighbors, showed that our model provides a robust performance, outperforming the baseline algorithms by 9% and 33% in terms of accuracy and F1-score, respectively. Thus, heterogeneous graph models provide a powerful alternative to predict port destination, and could support enhancing AIS data quality and better decision making in maritime transportation industries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.