DNA-binding proteins (DNABPs) are important for various cellular processes, such as transcriptional regulation, recombination, replication, repair, and DNA modification. So far various bioinformatics and machine learning techniques have been applied for identification of DNA-binding proteins from protein structure. Only few methods are available for the identification of DNA binding proteins from protein sequence. In this work, we report a random forest method, DNA-Prot, to identify DNA binding proteins from protein sequence. Training was performed on the dataset containing 146 DNA-binding proteins and 250 non DNA-binding proteins. The algorithm was tested on the dataset containing 92 DNA-binding proteins and 100 non DNA-binding proteins. We obtained 80.31% accuracy from training and 84.37% accuracy from testing. Benchmarking analysis on the independent of 823 DNA-binding proteins and 823 non DNA-binding proteins shows that our approach can distinguish DNA-binding proteins from non DNA-binding proteins with more than 80% accuracy. We also compared our method with DNAbinder method on test dataset and two independent datasets. Comparable performance was observed from both methods on test dataset. In the benchmark dataset containing 823 DNA-binding proteins and 823 non DNA-binding proteins, we obtained significantly better performance from DNA-Prot with 81.83% accuracy whereas DNAbinder achieved only 61.42% accuracy using amino acid composition and 63.5% using PSSM profile. Similarly, DNA-Prot achieved better performance rate from the benchmark dataset containing 88 DNA-binding proteins and 233 non DNA-binding proteins. This result shows DNA-Prot can be efficiently used to identify DNA binding proteins from sequence information. The dataset and standalone version of DNA-Prot software can be obtained from http://www3.ntu.edu.sg/home/EPNSugan/index_files/dnaprot.htm.
Real-world datasets commonly have issues with data imbalance. There are several approaches such as weighting, sub-sampling, and data modeling for handling these data. Learning in the presence of data imbalances presents a great challenge to machine learning. Techniques such as support-vector machines have excellent performance for balanced data, but may fail when applied to imbalanced datasets. In this paper, we propose a new undersampling technique for selecting instances from the majority class. The performance of this approach was evaluated in the context of several real biological imbalanced data. The ratios of negative to positive samples vary from ~9:1 to ~100:1. Useful classifiers have high sensitivity and specificity. Our results demonstrate that the proposed selection technique improves the sensitivity compared to weighted support-vector machine and available results in the literature for the same datasets.
bThe emergence of resistance to last-resort antibiotics is a public health concern of global scale. Besides direct person-to-person propagation, environmental pathways might contribute to the dissemination of antibiotic-resistant bacteria and antibiotic resistance genes (ARGs). Here, we describe the incidence of bla NDM-1 , a gene conferring resistance to carbapenems, in the wastewater of the city of Jeddah, Saudi Arabia, over a 1-year period. bla NDM-1 was detected at concentrations ranging from 10 4 to 10 5 copies/m 3 of untreated wastewater during the entire monitoring period. These results indicate the ubiquity and high incidence of bla NDM-1 in the local wastewater. To track the bacteria carrying bla NDM-1 , we isolated Escherichia coli PI7, a strain of sequence type 101 (ST101), from wastewater around the Hajj event in October 2013. Genome sequencing of this strain revealed an extensive repertoire of ARGs as well as virulence and invasive traits. These traits were further confirmed by antibiotic resistance profiling and in vitro cell internalization in HeLa cell cultures. Given that this strain remains viable even after a certain duration in the sewerage, and that Jeddah lacks a robust sanitary infrastructure to fully capture all generated sewage, the presence of this bacterium in the untreated wastewater represents a potential hazard to the local public health. To the best of our knowledge, this is the first report of a bla NDM-1 -positive E. coli strain isolated from a nonnosocomial environment in Saudi Arabia and may set a priority concern for the need to establish improved surveillance for carbapenem-resistant E. coli in the country and nearby regions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.