Here we introduce a new method of detecting pattern in microarray data series which is independent of the nature of this pattern. Our approach provides a measure of the algorithmic compressibility of each data series. A series which is significantly compressible is much more likely to result from simple underlying mechanisms than series which are incompressible. Accordingly, the gene associated with a compressible series is more likely to be biologically significant. We test our method on microarray time series of yeast cell cycle and show that it blindly selects genes exhibiting the expected cyclic behaviour as well as detecting other forms of pattern. Our results successfully predict two independent non-microarray experimental studies.
The complete versions of Table 2 and Figure 4, as well as other material, can be found at http://www.lps.ens.fr/~willbran/up-down/ or http://www.tcm.phy.cam.ac.uk/~tmf20/up-down/
We study the simplest random landscape, the curve formed by joining consecutive data points f1, . . . , fN+1 with line segments, where the fi are i.i.d. random numbers and fi = fj. We label each segment increasing (+) or decreasing (−) and call this string of +'s and −'s the up-down signature σ. We calculate the probability P (σ(f )) for a random curve and use it to bound the algorithmic information content of f . We show that f can be compressed by k = log 2 1/P (σ)−N bits, where k is a universal currency for comparing the amount of pattern in different curves. By applying our results to microarray time series data, we blindly identify regulatory genes.Introduction -Identifying trends or pattern in a data series is the traditional basis of hypothesis formation in the physical sciences [1]. Typically, the pattern is incontrovertible and can be encapsulated by a concise mathematical relation between the data and the independent variable. However, many systems exhibiting collective behaviour -such as genetic networks, financial markets and social systems -exhibit weak pattern, that is, the pattern does not look significantly different from a random curve. Moreover, because the dynamics of collective systems are in general not understood (at most a statistical description is possible), it is not clear what kind of pattern to look for.Random landscapes are central to the disciplines of spin glasses, drainage networks, protein folding, neural networks and combinatorial optimisation [2,3]. Properties of these systems are related to simple questions about their landscapes: How many minima are there? What is the size of their basins of attraction? What is the pattern of rises and falls?In this Letter we show that that there are fruitful underlying connections between the dynamical properties of a 1-D landscape and the presence of pattern in a series of data. Considering a series as a sequence of increases and decreases provides a method of compressing a curve, in the sense that the size of the file needed to store instructions for generating the curve is less than it would be by storing the curve outright. We derive a formal relation between the up-down properties of a curve and the algorithmic information content (AIC) of the equivalent data series, or size of the smallest file needed to store it, which is the ultimate test of pattern. As a demonstration of its efficacy, we use our method to blindly identify regulatory genes from a classic yeast cell cycle microarray data set. Random data and permutations -We study the simplest form of random landscape, a sequence of N +1 identically and independently distributed random numbers. We connect pairs of consecutive data points with line segments to form a curve. If we assume that the probability that two points are identical is negligible, we can label these
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.