BackgroundThis article describes a high-recall, high-precision approach for the extraction of biomedical entities from scientific articles.MethodThe approach uses a two-stage pipeline, combining a dictionary-based entity recognizer with a machine-learning classifier. First, the OGER entity recognizer, which has a bias towards high recall, annotates the terms that appear in selected domain ontologies. Subsequently, the Distiller framework uses this information as a feature for a machine learning algorithm to select the relevant entities only. For this step, we compare two different supervised machine-learning algorithms: Conditional Random Fields and Neural Networks.ResultsIn an in-domain evaluation using the CRAFT corpus, we test the performance of the combined systems when recognizing chemicals, cell types, cellular components, biological processes, molecular functions, organisms, proteins, and biological sequences. Our best system combines dictionary-based candidate generation with Neural-Network-based filtering. It achieves an overall precision of 86% at a recall of 60% on the named entity recognition task, and a precision of 51% at a recall of 49% on the concept recognition task.ConclusionThese results are to our knowledge the best reported so far in this particular task.
Background
The use of complementary and alternative medicine (CAM) among cancer patients is widespread and mostly self-administrated. Today, one of the most relevant topics is the nondisclosure of CAM use to doctors. This general lack of communication exposes patients to dangerous behaviors and to less reliable information channels, such as the Web. The Italian context scarcely differs from this trend. Today, we are able to mine and analyze systematically the unstructured information available in the Web, to get an insight of people’s opinions, beliefs, and rumors concerning health topics.
Objective
Our aim was to analyze Italian Web conversations about CAM, identifying the most relevant Web sources, therapies, and diseases and measure the related sentiment.
Methods
Data have been collected using the Web Intelligence tool ifMONITOR. The workflow consisted of 6 phases: (1) eligibility criteria definition for the ifMONITOR search profile; (2) creation of a CAM terminology database; (3) generic Web search and automatic filtering, the results have been manually revised to refine the search profile, and stored in the ifMONITOR database; (4) automatic classification using the CAM database terms; (5) selection of the final sample and manual sentiment analysis using a 1-5 score range; (6) manual indexing of the Web sources and CAM therapies type retrieved. Descriptive univariate statistics were computed for each item: absolute frequency, percentage, central tendency (mean sentiment score [MSS]), and variability (standard variation σ).
Results
Overall, 212 Web sources, 423 Web documents, and 868 opinions have been retrieved. The overall sentiment measured tends to a good score (3.6 of 5). Quite a high polarization in the opinions of the conversation partaking emerged from standard variation analysis (σ≥1). In total, 126 of 212 (59.4%) Web sources retrieved were nonhealth-related. Facebook (89; 21%) and Yahoo Answers (41; 9.7%) were the most relevant. In total, 94 CAM therapies have been retrieved. Most belong to the “biologically based therapies or nutrition” category: 339 of 868 opinions (39.1%), showing an MSS of 3.9 (σ=0.83). Within nutrition, “diets” collected 154 opinions (18.4%) with an MSS of 3.8 (σ=0.87); “food as CAM” overall collected 112 opinions (12.8%) with a MSS of 4 (σ=0.68). Excluding diets and food, the most discussed CAM therapy is the controversial Italian “Di Bella multitherapy” with 102 opinions (11.8%) with an MSS of 3.4 (σ=1.21). Breast cancer was the most mentioned disease: 81 opinions of 868.
Conclusions
Conversations about CAM and cancer are ubiquitous. There is a great concern about the biologically based therapies, perceived as harmless and useful, under-rating all risks related to dangerous interactions or malnutrition. Our results can be useful to doctors to be aware of the implications of these beliefs for the clinical practice. Web conversation exploitation could be a strategy to gain insights of people’s perspective for other controversial topics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.