2020
DOI: 10.1038/s41598-020-71450-8
|View full text |Cite
|
Sign up to set email alerts
|

Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction

Abstract: The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and impl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 28 publications
(15 citation statements)
references
References 43 publications
2
11
0
Order By: Relevance
“…Therefore, we selected CT-padding as the default padding strategy in the experiments that followed and in the training of the final production models. Though it is unclear the reason why CT-padding performs better, our finding is in agreement with the study of Lopez-del Rio et al who compared eight different padding strategies for protein functional prediction using one-hot encoding as the feature. They confirmed that postpadding (i.e., CT-padding here) outperforms the other padding types, including prepadding (i.e., NT-padding here), middle-padding, stratified-padding, extreme-padding, random-padding, zoom-padding, and augmented-padding for convolutional architectures.…”
Section: Resultssupporting
confidence: 93%
“…Therefore, we selected CT-padding as the default padding strategy in the experiments that followed and in the training of the final production models. Though it is unclear the reason why CT-padding performs better, our finding is in agreement with the study of Lopez-del Rio et al who compared eight different padding strategies for protein functional prediction using one-hot encoding as the feature. They confirmed that postpadding (i.e., CT-padding here) outperforms the other padding types, including prepadding (i.e., NT-padding here), middle-padding, stratified-padding, extreme-padding, random-padding, zoom-padding, and augmented-padding for convolutional architectures.…”
Section: Resultssupporting
confidence: 93%
“…We want to highlight that the overall accuracy of the model (~93%) could have been improve by employing a larger dataset and using more materials. In addition, methodologies related to the treatment of the data, such as noise analysis (reduction or inclusion) [38,39], resampling and smoothing [40], and padding configurations [41] could have enhanced the model's accuracy.…”
Section: Influence Of the Unloading Curve On The Accuracy Of The Cnn Modelmentioning
confidence: 99%
“…The analogy between these fields motivates the applications of the Natural Language Processing (NLP) techniques to FOG detection by changing the length of the sequence with minimum impact on the neural network. When applying machine learning to NLP and protein functional prediction the issue of variable sequence length is solved by adding pad values to the sequences to ensure sequences with the same length [26][27] [28][29][30] [31]. Zero padding is frequently used in the NLP and protein prediction and the padding operation concatenates a vector of zero values to the measured sequences [30][31] [32][33] [34].…”
Section: Related Workmentioning
confidence: 99%