In recent years, oligo microarrays, or more commonly-known DNA chips, have had a major impact in disease diagnosis, drug discovery, and gene identification. Microarrays contain Nmer DNA fragments, or oligos, in a series of "wells" placed across the chip, where each well contains thousands of the same fragments and acts as a probe that detects the amount of a specific fragment. A recent use for microarrays is for identification of genomes, such as pathogens. In current techniques, probes that detect unique gene regions of particular species are selected to be placed on the microarray, using the assumption that if one gene unique to a pathogen species can be detected, then the pathogen can be classified. This approach is useful, but the technology relies on finding the gene sequences that are divergent enough to be used as a genomic identifier and robust to cross-hybridization. In our work, we present a method to choose the most unique probes between two organisms. We accomplish this by choosing the oligo probes that maximize the level of divergence between the genomes, calculated by three different information-theoretic measures. We show the results for a 12-mer and 25-mer oligo pathogen probe set and that our method chooses probes less likely to cross-hybridize.Index Terms-DNA, Information Theory, Microorganisms
MICROARRAYS AND PATHOGEN DETECTIONThe number of published sequenced genomes has been exponentially increasing since 1995, with over 600 completed by 2007 [1]. This makes an expanding database for genome analysis, including phylogeny (evolutionary tree) studies, comparative gene analysis, etc. A new application using apriori genome databases is to characterize genomic fingerprints to identify a particular genome, especially if it is a pathogen. In the field of metagenomics, the study of genetic material recovered from environmental samples, scientists now have the potential to classify a species based on genomic features and compare it to signatures found in a database rather than using taxonomy and classical phenotypic features.The two major methods for identifying pathogens are PCR (polymerase chain reaction) assays and microarray chips. In PCR methods, slight variations unique to certain genomes, such as SNPs (single nucleotide polymorphisms), are examined to characterize particular pathogens. Unfortunately, in genomes with high mutations, such as HIV virii, PCR needs more than a few identifiers [2]. Many pathogens contain mobile genetic elements that can potentially interfere with proper identification using single loci detection systems and therefore, many loci are needed.With the advent of microarrays, there have been several methods developed [3,4,5]. In [4], each probe selected is a substring of a gene, which acts as its fingerprint. Their Findprobe program takes only genes as input and designs probes which satisfy homogeneity, sensitivity, and specificity constraints. In [5], a highly redundant probe pair is placed on the chip for each diagnostic region specified. Almost all previous micr...