Abstract-Recent advent of the new high-throughput biological technologies has brought more challenges to the computer science community in terms of the amount and variety of biological data awaiting for analysis. Computationally intensive techniques such as pattern recognition and machine learning algorithms have been applied to extract knowledge from several biological domains ranging from genomics, proteomics to system biology and evolution process. Learning techniques applied to the computational biology are mostly in the category of classification. Therefore, the sequence analysis problem has to be formulated as classification task, which is quite difficult due to the unobvious one-to-one mapping of the problem. In this paper, we propose a different setting of sequence analysis formulation based on the nucleotide patterns using a constraint logic programming paradigm, in which the sequence alignment can be performed through pattern matching techniques. With available knowledge from the field of pattern mining, we can apply the well-established techniques within the new framework of constraint programming. However, to make the system efficiently work, we need a new set of constraint solver algorithms specifically designed for the sequence analysis problem. The design and implementation of such algorithms are thus the main focus of our research project. We propose in this paper the design of a constraint-based system for genomic sequence analysis including the algorithm for the constraint solver, a major part of the proposed system. Index Terms-Genomic sequence analysis, constraint-based system, constraint solver algorithm, constraint programming.
I. INTRODUCTIONLiving organisms contain multiples cells to perform different functions. There are two basic types of cells: prokaryote cells (found in bacteria) and eukaryote cells (appeared in plants and animals). Contained within the cell membrane are several organelles and thousands of different types of molecules, the important one is DNA (deoxyribonucleic acid) that carries the entire genetic inheritance, or genes, of the cell. DNA is a long polymer molecule that contains sugar, phosphate group, and a mixture of four different nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T).In genetic information in biological systems (Fig. 1) that firstly DNA is copied to more DNA in the replication process, then DNA is transcribed into mRNA (or messenger-RNA) in a transcription process and finally mRNA is translated (by ribosome) into protein in a translation process.This overall process of biological protein synthesis is known as gene expression. Understanding the process of gene expression in different types of cells and under different conditions is one of the fundamental research aspects of genomics, which is all the studies related to genes.In prokaryotes, genetic information is encoded continuously on a DNA strand. But in eukaryotes, regions that code for protein (called exons) are interrupted by the non-coding regions (called introns). During t...