Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Jiang, Xiangying; Ringwald, Martin; Blake, Judith A.; Shatkay, Haggit

doi:10.1093/database/baaa043

Cited by 7 publications

(13 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Captions associated with figures provide another important source of information for biomedical document classification. In order to make use of captions, we employ a standard preprocessing procedure that includes named-entity recognition (NER), stemming and stop-words removal as we have done in our earlier work ( Jiang et al , 2017 , 2020 ). For NER, we first identify all gene, disease, chemical, species, mutation and cell-line concepts using PubTator, which is widely used for annotations of biomedical concepts ( Wei et al , 2019 ).…”

Section: Methodsmentioning

confidence: 99%

“…Image captions have been shown effective for document classification in several studies ( Burns et al , 2019 ; Jiang et al , 2017 , 2020 ; Regev et al , 2002 ). For instance, Burns et al (2019) compared classification performance under different information sources, when identifying publications containing molecular interaction information, relevant to the IntAct Molecular Interaction database ( Kerrien et al , 2012 ).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Utilizing image and caption information for biomedical document classification

Jiang

Zhang

et al. 2021

Bioinformatics

Self Cite

View full text Add to dashboard Cite

Motivation Biomedical research findings are typically disseminated through publications. To simplify access to domain-specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature—a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results. Results We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance. Availability and implementation Source code and the list of PMIDs of the publications in our datasets are available upon request.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Utilizing image and caption information for biomedical document classification

Jiang

Zhang

et al. 2021

Bioinformatics

Self Cite

View full text Add to dashboard Cite

show abstract

“…As we demonstrated in our earlier work (13), image captions in biomedical publications, which form brief summaries of the images, contain significant and useful information for determining the topic discussed in the publications. As part of future work, we plan to integrate image captions into the classification scheme.…”

Section: Discussionmentioning

confidence: 86%

“…Much work over the past two decades aimed to address biomedical document classification. Most of the proposed methods are trained and tested over balanced data sets, in which all classes are similar in size (13–16). However, biomedical data sets are typically highly imbalanced, where relatively few publications within a large volume of literature are actually relevant to any specific topic of interest (17).…”

Section: Introductionmentioning

confidence: 99%

“…In our own preliminary work (13), we presented an effective—yet relatively simple—classification scheme using readily available tools, while employing several of our statistical feature selection strategies, for identifying publications relevant to GXD among a large set of MGI documents. Our proposed method attained high performance (>0.9 on all performance measures) when trained and tested over a large balanced data set of curated GXD publications.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

et al. 2019

Self Cite

View full text Add to dashboard Cite

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory’s Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.

show abstract

Classifying domain-specific text documents containing ambiguous keywords

et al. 2021

View full text Add to dashboard Cite

A keyword-based search of comprehensive databases such as PubMed may return irrelevant papers, especially if the keywords are used in multiple fields of study. In such cases, domain experts (curators) need to verify the results and remove the irrelevant articles. Automating this filtering process will save time, but it has to be done well enough to ensure few relevant papers are rejected and few irrelevant papers are accepted. A good solution would be fast, work with the limited amount of data freely available (full paper body may be missing), handle ambiguous keywords and be as domain-neutral as possible. In this paper, we evaluate a number of classification algorithms for identifying a domain-specific set of papers about echinoderm species and show that the resulting tool satisfies most of the abovementioned requirements. Echinoderms consist of a number of very different organisms, including brittle stars, sea stars (starfish), sea urchins and sea cucumbers. While their taxonomic identifiers are specific, the common names are used in many other contexts, creating ambiguity and making a keyword search prone to error. We try classifiers using Linear, Naïve Bayes, Nearest Neighbor, Tree, SVM, Bagging, AdaBoost and Neural Network learning models and compare their performance. We show how effective the resulting classifiers are in filtering irrelevant articles returned from PubMed. The methodology used is more dependent on the good selection of training data and is a practical solution that can be applied to other fields of study facing similar challenges. Database URL : The code and date reported in this paper are freely available at http://xenbaseturbofrog.org/pub/Text-Topic-Classifier/

show abstract

Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD)

Cited by 7 publications

References 1 publication

Utilizing image and caption information for biomedical document classification

Utilizing image and caption information for biomedical document classification

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

Classifying domain-specific text documents containing ambiguous keywords

Contact Info

Product

Resources

About