Traditionally, intra-sequence similarity is exploited for compressing a single DNA sequence. Recently, remarkable compression performance of individual DNA sequence from the same population is achieved by encoding its difference with a nearly identical reference sequence. Nevertheless, there is lack of general algorithms that also allow less similar reference sequences. In this work, we extend the intra-sequence to the inter-sequence similarity in that approximate matches of subsequences are found between the DNA sequence and a set of reference sequences. Hence, a set of nearly identical DNA sequences from the same population or a set of partially similar DNA sequences like chromosome sequences and DNA sequences of related species can be compressed together. For practical compressors, the compressed size is usually influenced by the compression order of sequences. Fast search algorithms for the optimal compression order are thus developed for multiple sequences compression. Experimental results on artificial and real datasets demonstrate that our proposed multiple sequences compression methods with fast compression order search are able to achieve good compression performance under different levels of similarity in the multiple DNA sequences.
Current DNA compression algorithms rely on finding repetitions within the DNA sequence so that similar subsequences can be encoded by referencing to each other. We explore similarities between different chromosomes of the sequence 'Saccharomyces cerevisiae'. These similarities are characterised by the existence of similar subsequences among different chromosomes. The longer the similar subsequences are, the higher the cross-similarities are. Our study indicates that these cross-sequence similarities are often significant as compared to self-sequence similarity. This implies that it would be advantageous to compress two or more chromosome sequences together so that similar subsequences found between multiple chromosome sequences can be encoded together.
Articles you may be interested inCoarse-grained modeling of DNA oligomer hybridization: Length, sequence, and salt effects J. Chem. Phys. 141, 035102 (2014); 10.1063/1.4886336 Study on the stability of the Quadruplex DNA Structure formed by the human telomeric repeat sequence d [ AG 3 ( TTAGGG ) 3 ] AIP Conf. Proc. 1071, 62 (2008); 10.1063/1.3033361 Low-energy electron diffraction and induced damage in hydrated DNAAbstract. Current DNA compression algorithms rely on finding repetitions within the DNA sequence so that similar subsequences can be encoded by referencing to each other. In this paper, we explore similarities between different chromosomes of the sequence "Saccharomyces cerevisiae". These similarities are characterized by the existence of similar subsequences among different chromosomes. The longer the similar subsequences are, the higher the crosssimilarities are. Our study indicates that these cross-sequence similarities are often significant as compared to self-sequence similarities. This implies that it would be advantageous to compress two or more sequences together so that similar subsequences found between multiple sequences can be encoded together.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.