Abstract.We study two fundamental problems concerning the search for interesting regions in sequences: (i) given a sequence of real numbers of length n and an upper bound U , find a consecutive subsequence of length at most U with the maximum sum and (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. We present an O(n)-time algorithm for the first problem and an O(n log L)-time algorithm for the second. The algorithms have potential applications in several areas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing sequence alignments, annotating multiple sequence alignments, and computing length-constrained ungapped local alignment. Our preliminary tests on both simulated and real data demonstrate that the algorithms are very efficient and able to locate useful (such as GC-rich) regions.
Periodontitis is an inflammatory disease involving complex interactions between oral microorganisms and the host immune response. Understanding the structure of the microbiota community associated with periodontitis is essential for improving classifications and diagnoses of various types of periodontal diseases and will facilitate clinical decision-making. In this study, we used a 16S rRNA metagenomics approach to investigate and compare the compositions of the microbiota communities from 76 subgingival plagues samples, including 26 from healthy individuals and 50 from patients with periodontitis. Furthermore, we propose a novel feature selection algorithm for selecting features with more information from many variables with a combination of these features and machine learning methods were used to construct prediction models for predicting the health status of patients with periodontal disease. We identified a total of 12 phyla, 124 genera, and 355 species and observed differences between health- and periodontitis-associated bacterial communities at all phylogenetic levels. We discovered that the genera Porphyromonas, Treponema, Tannerella, Filifactor, and Aggregatibacter were more abundant in patients with periodontal disease, whereas Streptococcus, Haemophilus, Capnocytophaga, Gemella, Campylobacter, and Granulicatella were found at higher levels in healthy controls. Using our feature selection algorithm, random forests performed better in terms of predictive power than other methods and consumed the least amount of computational time.
Indoor microbial communities have important implications for human health, especially in health-care institutes (HCIs). The factors that determine the diversity and composition of microbiomes in a built environment remain unclear. Herein, we used 16S rRNA amplicon sequencing to investigate the relationships between building attributes and surface bacterial communities among four HCIs located in three buildings. We examined the surface bacterial communities and environmental parameters in the buildings supplied with different ventilation types and compared the results using a Dirichlet multinomial mixture (DMM)-based approach. A total of 203 samples from the four HCIs were analyzed. Four bacterial communities were grouped using the DMM-based approach, which were highly similar to those in the 4 HCIs. The α-diversity and β-diversity in the naturally ventilated building were different from the conditioner-ventilated building. The bacterial source composition varied across each building. Nine genera were found as the core microbiota shared by all the areas, of which Acinetobacter, Enterobacter, Pseudomonas, and Staphylococcus are regarded as healthcare-associated pathogens (HAPs). The observed relationship between environmental parameters such as core microbiota and surface bacterial diversity suggests that we might manage indoor environments by creating new sanitation protocols, adjusting the ventilation design, and further understanding the transmission routes of HAPs.
In this paper, we consider two distinct problems related to complexity aspects of the visibility graphs of simple polygons. Recognizing visibility graphs is a long-standing open problem. It is not even known whether visibility graph recognition is in NP. That visibility graph recognition is in NP would be established if we could demonstrate that any n vertex visibility graph is realized by a polygon which can be drawn on an exponentially-sized grid. This motivates a study of the area requirements for realizing visibility graphs. In this paper, we prove: • Θ(n3) area is necessary and sufficient to realize the complete visibility graph Kn. • There exist visibility graphs which require exponential area to realize. • Any maximal outerplanar graph of diameter d can be realized in O(d2 · 2d) area, which can be as small as O(n log2 n) for a balanced mop. Linear maximal outer-planar graphs can be realized in O(n8) area. The second part of this paper considers the complexity of specific optimization problems on visibility graphs. Given a polygon P, we show that finding a maximum independent set, minimum vertex cover, or maximum dominating set in the visibility graph of P are all NP-complete. Further we show that for polygons P1 and P2, the problem of testing if they have isomorphic visibility graphs is isomorphism-complete. These problems remain hard when given the visibility graphs as input.
We study two fundamental problems concerning the search for interesting regions in sequences: (i) given a sequence of real numbers of length n and an upper bound U , find a consecutive subsequence of length at most U with the maximum sum and (ii) given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. We present an O(n)-time algorithm for the first problem and an O(n log L)-time algorithm for the second. The algorithms have potential applications in several areas of biomolecular sequence analysis including locating GC-rich regions in a genomic DNA sequence, post-processing sequence alignments, annotating multiple sequence alignments, and computing length-constrained ungapped local alignment. Our preliminary tests on both simulated and real data demonstrate that the algorithms are very efficient and able to locate useful (such as GC-rich) regions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.