A keyword-based search of comprehensive databases such as PubMed may return
irrelevant papers, especially if the keywords are used in multiple fields of
study. In such cases, domain experts (curators) need to verify the results and
remove the irrelevant articles. Automating this filtering process will save
time, but it has to be done well enough to ensure few relevant papers are
rejected and few irrelevant papers are accepted. A good solution would be fast,
work with the limited amount of data freely available (full paper body may be
missing), handle ambiguous keywords and be as domain-neutral as possible. In
this paper, we evaluate a number of classification algorithms for identifying a
domain-specific set of papers about echinoderm species and show that the
resulting tool satisfies most of the abovementioned requirements. Echinoderms
consist of a number of very different organisms, including brittle stars, sea
stars (starfish), sea urchins and sea cucumbers. While their taxonomic
identifiers are specific, the common names are used in many other contexts,
creating ambiguity and making a keyword search prone to error. We try
classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM,
Bagging, AdaBoost and Neural Network learning models and compare their
performance. We show how effective the resulting classifiers are in filtering
irrelevant articles returned from PubMed. The methodology used is more dependent
on the good selection of training data and is a practical solution that can be
applied to other fields of study facing similar challenges.
Database URL
: The code and date reported in this paper are freely
available at
http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/