2015
DOI: 10.1016/j.ymeth.2014.10.027
|View full text |Cite|
|
Sign up to set email alerts
|

Text as data: Using text-based features for proteins representation and for computational prediction of their characteristics

Abstract: The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined. Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year. Computational tools that attempt to automatically predict and annotate protein charact… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 21 publications
(9 citation statements)
references
References 48 publications
0
9
0
Order By: Relevance
“…Many text classification tasks utilize BoW and achieve very good performance while some have tried to recognize functional classes from text with BoW models with poorer results [ 19 , 20 ]. Their applicability to function prediction has only begun to be studied in this work and Wong et al [ 6 ]. One explanation for their performance could be due to their higher utilization of the biomedical literature; co-mentions only capture information when both a protein and GO term are recognized together while BoW only relies on a protein to be recognized.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Many text classification tasks utilize BoW and achieve very good performance while some have tried to recognize functional classes from text with BoW models with poorer results [ 19 , 20 ]. Their applicability to function prediction has only begun to be studied in this work and Wong et al [ 6 ]. One explanation for their performance could be due to their higher utilization of the biomedical literature; co-mentions only capture information when both a protein and GO term are recognized together while BoW only relies on a protein to be recognized.…”
Section: Resultsmentioning
confidence: 99%
“…In order to have enough training data for each functional class, they condensed information from all terms to those GO terms in the second level of the hierarchy, which results in only predicting 34 terms out of the thousands in the Molecular Function and Biological Process sub-ontologies. Recently, there has been more in-depth analysis into how to use text-based features to represent proteins from the literature without relying on manually annotated data or information extraction algorithms [ 6 ]. This work explored using abstracts along with unigram/bigram feature representation of proteins.…”
Section: Introductionmentioning
confidence: 99%
“…For example, TM and NLP systems have been used to identify new candidate compounds for drug repurposing [48,49], analyze relationships between proteostasis protein factors and cancer [50], prioritize cancer genes and pathways [51], predict protein functions [52], and extract disease-related biomarkers [53], as well as find associations between TFs [54]. Additionally, the text has been used as features to represent protein structures and subsequently predict their characteristics computationally [55]. Other useful applications of TM and NLP have been reported in the literature [56][57][58][59][60][61].…”
Section: Exploring Voluminous Informationmentioning
confidence: 99%
“…Text mining found applications in different biomedical domains [31,[35][36][37][38][39][40][41][42][43][44][45][46][47][48], for example, dealing with problems of cancers [42], disease biomarkers [47], sickle cell disease [49], tomato species [50], medicinal herbs [35], sodium channels [51], drug repurposing [37], protein analysis [40,52], prioritization of cancer genes and pathways [41], hepatitis C virus [53], cancer risk assessment [48], associations of mutations and human diseases [54], or association of transcription factors [55].…”
Section: Introductionmentioning
confidence: 99%