This paper presents a prediction of Bacillus subtilis promoters using a Support Vector Machine system. In the literature, there is a lack of information on Gram-positive bacterial promoter sequences compared to Gram-negative bacteria. Promoter sequence identification is essential for studying gene expression. Initially, we collected the B. subtilis genome sequence from the NCBI database, and promoters were identified by their sigma factors in the DBTBS database. We then grouped the promoters according to 15 factors in 2 domains, corresponding to sigma 54 and sigma 70 of Gram-negative bacteria. Based on these data we developed a script in Python to search for promoters in the B. subtilis genome. After processing the data, we obtained 767 promoter sequences for B. subtilis, most of which were recognized by sigma SigA. To validate the data we found, we developed a software package called BacSVM+, which receives promoters as input and returns the best combination of parameters in a LibSVM library to predict promoter regions in the bacteria used in the simulation. All data gathered as well as the BacSVM+ software is available for download at http://bacpp.bioinfoucs.com/rafael/Sigmas.zip.
The transcription machinery of archaea can be roughly classified as a simplified version of eukaryotic organisms. The basal transcription factor machinery binds to the TATA box found around 28 nucleotides upstream of the transcription start site; however, some transcription units lack a clear TATA box and still have TBP/TFB binding over them. This apparent absence of conserved sequences could be a consequence of sequence divergence associated with the upstream region, operon, and gene organization. Furthermore, earlier studies have found that a structural analysis gains more information compared with a simple sequence inspection. In this work, we evaluated and coded 3630 archaeal promoter sequences of three organisms, Haloferax volcanii, Thermococcus kodakarensis, and Sulfolobus solfataricus into DNA duplex stability, enthalpy, curvature, and bendability parameters. We also split our dataset into conserved TATA and degenerated TATA promoters to identify differences among these two classes of promoters. The structural analysis reveals variations in archaeal promoter architecture, that is, a distinctive signal is observed in the TFB, TBP, and TFE binding sites independently of these being TATA‐conserved or TATA‐degenerated. In addition, the promoter encountering method was validated with upstream regions of 13 other archaea, suggesting that there might be promoter sequences among them. Therefore, we suggest a novel method for locating promoters within the genome of archaea based on DNA energetic/structural features.
Background
Archaea are a vast and unexplored domain. Bioinformatic techniques might enlighten the path to a higher quality genome annotation in varied organisms. Promoter sequences of archaea have the action of a plethora of proteins upon it. The conservation found in a structural level of the binding site of proteins such as TBP, TFB, and TFE aids RNAP-DNA stabilization and makes the archaeal promoter prone to be explored by statistical and machine learning techniques.
Results and discussions
In this study, experimentally verified promoter sequences of the organisms Haloferax volcanii, Sulfolobus solfataricus, and Thermococcus kodakarensis were converted into DNA duplex stability attributes (i.e. numerical variables) and were classified through Artificial Neural Networks and an in-house statistical method of classification, being tested with three forms of controls. The recognition of these promoters enabled its use to validate unannotated promoter sequences in other organisms. As a result, the binding site of basal transcription factors was located through a DNA duplex stability codification. Additionally, the classification presented satisfactory results (above 90%) among varied levels of control.
Concluding remarks
The classification models were employed to perform genomic annotation into the archaea Aciduliprofundum boonei and Thermofilum pendens, from which potential promoters have been identified and uploaded into public repositories.
Promoters are DNA sequences located upstream of the transcription start site of genes. In bacteria, the RNA polymerase enzyme requires additional subunits, called sigma factors (σ) to begin specific gene transcription in distinct environmental conditions. Currently, promoter prediction still poses many challenges due to the characteristics of these sequences. In this paper, the nucleotide content of Escherichia coli promoter sequences, related to five alternative σ factors, was analyzed by a machine learning technique in order to provide profiles according to the σ factor which recognizes them. For this, the clustering technique was applied since it is a viable method for finding hidden patterns on a data set. As a result, 20 groups of sequences were formed, and, aided by the Weblogo tool, it was possible to determine sequence profiles. These found patterns should be considered for implementing computational prediction tools. In addition, evidence was found of an overlap between the functions of the genes regulated by different σ factors, suggesting that DNA structural properties are also essential parameters for further studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.