Protein secondary structure is crucial to create an information bridge between the primary structure and the tertiary (3D) structure. Precise prediction of 8-state protein secondary structure (PSS) significantly utilized in the structural and functional analysis of proteins in bioinformatics. In this recent period, deep learning techniques have been applied in this research area and raise the Q8 accuracy remarkably. Nevertheless, from a theoretical standpoint, there still lots of room for improvement, specifically in 8-state (Q8) protein secondary structure prediction. In this paper, we presented two deep learning architecture, namely 1D-Inception and BD-LSTM, to improve the performance of 8-classes PSS prediction. The input of these two architectures is a carefully constructed feature matrix from the sequence features and profile features of the proteins. Firstly, 1D-Inception is a Deep convolutional neural network-based approach that was inspired by the InceptionV3 model and containing three inception modules. Secondly, BD-LSTM is a recurrent neural network model which including bidirectional LSTM layers. Our proposed 1D-Inception method achieved 76.65%, 71.18%, 76.86%, and 74.07% Q8 accuracy respectively on benchmark CullPdb6133, CB513, CASP10, and CASP11 datasets. Moreover, BD-LSTM acquired 74.71%, 69.49%, 74.07%, and 72.37% state-8 accuracy after evaluated on CullPdb6133, CB513, CASP10, and CASP11 datasets, respectively. Both these architectures enable the efficient processing of local and global interdependencies between amino acids to make an accurate prediction of each class is very beneficial in the deep neural network. To the best of our knowledge, experiment results of the 1D-Inception model demonstrate that it outperformed all the state-of-art methods on the benchmark CullPdb6133, CB513, and CASP10 datasets.
Datasets and Methodology DatasetsHere, we utilize five different datasets, namely, CullPdb 6133, CullPdb 6133 filtered, Cb513, Casp10, and Casp11. Among these five datasets CullPdb 6133, and CullPdb 6133 filtered for training. Furthermore, CB5133, Casp10, Casp11, and 272 protein sequence of CullPdb 6133 for testing. CullPdb 6133: CullPdb 6133 [51] dataset is a non-homologous protein dataset that is provided by PISCES CullPDB with the familiar secondary structure for protein. This dataset contains a total of 6128 protein sequences, in which 5600 ([0:5600]) protein samples are considered as the training set, 272 protein samples [5605:5877] for testing, and 256 proteins samples ([5877,6133]) regarded as the validation set. Moreover, CullPdb 6133 (non-filtered) dataset has 57 features, such as amino acid residues (features [0:22)), N-and C-terminals (features [31,33)), relative and absolute solvent accessibility ([33,35)), and features of sequence profiles (features [35:57)). We used secondary structure notation (features [22:31)) for labeling. This CullPdb dataset is publicly obtainable from [2].