We introduce the first meta-service for information extraction in molecular biology, the BioCreative MetaServer (BCMS; http:// bcms.bioinfo.cnio.es/). This prototype platform is a joint effort of 13 research groups and provides automatically generated annotations for PubMed/Medline abstracts. Annotation types cover gene names, gene IDs, species, and protein-protein interactions. The annotations are distributed by the meta-server in both human and machine readable formats (HTML/XML). This service is intended to be used by
Background: Experimentally verified protein-protein interactions (PPI) cannot be easily retrieved by researchers unless they are stored in PPI databases. The curation of such databases can be made faster by ranking newly-published articles' relevance to PPI, a task which we approach here by designing a machine-learning-based PPI classifier. All classifiers require labeled data, and the more labeled data available, the more reliable they become. Although many PPI databases with large numbers of labeled articles are available, incorporating these databases into the base training data may actually reduce classification performance since the supplementary databases may not annotate exactly the same PPI types as the base training data. Our first goal in this paper is to find a method of selecting likely positive data from such supplementary databases. Only extracting likely positive data, however, will bias the classification model unless sufficient negative data is also added. Unfortunately, negative data is very hard to obtain because there are no resources that compile such information. Therefore, our second aim is to select such negative data from unlabeled PubMed data. Thirdly, we explore how to exploit these likely positive and negative data. And lastly, we look at the somewhat unrelated question of which term-weighting scheme is most effective for identifying PPI-related articles.
Biomedical researchers rely on keyword-based search engines to retrieve superficially relevant documents, from which they must filter out irrelevant information manually. Hence, there is an urgent need for a more efficient system to help them rapidly locate specific molecular events and the participants involved in these events. In this paper, we propose a novel search system with a new search interface and answer ranking scheme. Due to the limited number of query types in the Biomedical-specific searches, we employ a form-based interface with various query templates for specifying required information. This can ascertain a user's intentions more accurately than a conventional keyword-based interface. Ranking is another key issue in this type of search. We propose a linear ranking model, trained by a supervised learning algorithm, which combines different features. Two semantic features, named entity types and semantic roles, are incorporated into the model to help match a query with entities in relevant documents. After employing all effective semantic features, our system achieves a Top-1 accuracy of 43.1% and Top-5 MRR of 47.1%. In comparison with the baseline system, Top-1 accuracy and Top-5 MRR increase by 9.5% and 7.1%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.