2012
DOI: 10.1371/journal.pone.0050039
|View full text |Cite
|
Sign up to set email alerts
|

Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach

Abstract: The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confir… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
41
0

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 29 publications
(44 citation statements)
references
References 46 publications
3
41
0
Order By: Relevance
“…There are some noteworthy recent studies that encourage this line of approach: for example, nonrandom distributions of 5-aa SCS are demonstrated in the current proteome databases [38], confirming the previous finding that biological bias occurs in protein coding [28,29]. Among these existing studies, our approach is operationally one of the simplest, and it emphasizes analogies between languages and protein sequences [32,33]. Encouragingly, linguistic aspects of proteins have been noted in other studies [48,49].…”
Section: Introductionsupporting
confidence: 82%
See 1 more Smart Citation
“…There are some noteworthy recent studies that encourage this line of approach: for example, nonrandom distributions of 5-aa SCS are demonstrated in the current proteome databases [38], confirming the previous finding that biological bias occurs in protein coding [28,29]. Among these existing studies, our approach is operationally one of the simplest, and it emphasizes analogies between languages and protein sequences [32,33]. Encouragingly, linguistic aspects of proteins have been noted in other studies [48,49].…”
Section: Introductionsupporting
confidence: 82%
“…The advantage of the alignment-free approach is that any collections of proteins can be compared quantitatively. Although various types of alignment-free approaches have been developed [24,25], including our previous attempts to use membrane topology [26] and a self-organizing map [27], the alignment-free approach in the present study is based on the "availability" (frequency bias) of short constituent sequences (SCSs) of amino acids (aa) in proteins [28][29][30][31][32][33]. The length of SCSs can be 2 aa (doublet), 3 aa (triplet), 4 aa (quartet), 5 aa (pentat), and more in a given protein.…”
Section: Introductionmentioning
confidence: 99%
“…Much like humans adopt languages to communicate, biological organisms use sophisticated languages to convey information within and between cells. Inspired by this conceptual analogy, we adopt existing methods in natural language processing (NLP) to gain a deeper understanding of the "language of life" with the ultimate goal to discover functions encoded within biological sequences [1][2][3][4].…”
Section: Introductionmentioning
confidence: 99%
“…They call the short consequent sequences (SCS) present in protein sequences as words and use availability scores to assess the biological usage bias of SCS. Our approach of using MDL for segmentation is interesting in that it does not require prior fixing of word length as in (Motomura et al, 2012), (Motomura et al, 2013).…”
Section: Related Workmentioning
confidence: 99%