Abstract. The use of sequence alignments for establishing protein homology relationships has an extensive tradition in the field of bioinformatics, and there is an increasing desire for more statistical methods in the data analysis. We present statistical methods and algorithms that are useful when the protein alignments can be divided into two populations based on known features or traits. The algorithms are considered valuable for discovering differences between populations at a molecular level. The approach is illustrated with examples from real biological data sets, and we present experimental results in applying our work on bacterial populations of Vibrio, where the populations are defined by optimal growth temperature, T opt .Keywords: sequence analysis; structural analysis; physicochemical properties; extremophiles; Fisher's exact test; Wilcoxon test.
Biological MotivationExtreme environments are those that fall outside the limited range in which we, and most other eukaryotes can survive, and are inhabited by the extremophiles. Among extremophiles, which include thermophiles, psychrophiles, acidophiles, alkalophiles, halophiles, barophiles and xerophiles, those who live and prefer low temperatures are the largest and least studied group. Psycrophilic organisms are living at temperatures close to the freezing point of water. It is of great interest to understand how these organisms can function at "the limits of life" [1].Living at extreme temperatures requires a multiplicity of crucial adaptations including preservation of membrane stability and maintenance of enzymatic activities at appropriate levels. At these temperatures a number of physiological factors are changed; the solubility of gases is not the same, the viscosity of water changes several folds as temperature is changed towards the extreme areas, for example.The number of characterized cold or heat adapted proteins, reported sequences and high resolution structures is growing. The Vibrios are of the species with the greatest amount of published genomes, reaching five completed genomes this year, and seven ongoing whole genome sequencing projects including the cold adapted Vibrio salmonicida.Alignment-free analysis has been used previously to compare amino acid compositions in whole genome and proteome datasets [2] [3]. In this study, we focus on a set of homolog protein data from a relatively narrow range of closely related