We apply the VOBN model to a set of 238 experimentally verified sigma-70 binding sites in Escherichia coli. We find that the VOBN model can distinguish these 238 sites from a set of 472 intergenic 'non-promoter' sequences with a higher accuracy than fixed-order Markov models or Bayesian trees. We use a replicated stratified-holdout experiment having a fixed true-negative rate of 99.9%. We find that for a foreground inhomogeneous VOBN model of order 1 and a background homogeneous variable-order Markov (VOM) model of order 5, the obtained mean true-positive (TP) rate is 47.56%. In comparison, the best TP rate for the conventional models is 44.39%, obtained from a foreground PWM model and a background 2nd-order Markov model. As the standard deviation of the estimated TP rate is approximately 0.01%, this improvement is highly significant.
As device geometry continues to shrink, micro-contaminants have an increasingly negative impact on yield. By diminishing the contamination problem, semiconductor manufacturers will significantly improve the wafer yield. This paper presents a comprehensive and successful application of data mining methodologies to the refinement of a new dry cleaning technology that utilizes a laser beam for the removal of micro-contaminants. Experiments with three classification-based data mining methods (decision tree induction, neural networks, and composite classifiers) have been conducted. The composite classifier architecture has been shown to yield higher accuracy than the accuracy of each individual classifier on its own. The paper suggests that data mining methodologies may be particularly useful when data is scarce, and the various physical and chemical parameters that affect the process exhibit highly complex interactions. Another implication is that on-line monitoring of the cleaning process using data mining may be highly effective.
Universal compression algorithms can detect recurring patterns in any type of temporal data -including financial data -for the purpose of compression. The universal algorithms actually find a model of the data that can be used for either compression or prediction. We present a universal Variable Order Markov (VOM) model and use it to test of the weak form of the Efficient Market Hypothesis (EMH).The EMH is tested for 12 pairs of international intra-day currency exchange rates for one year series of 1,5,10,15,20,25 and 30 minutes. Statistically significant compression is detected in all the time-series and the high frequency series are also predictable above random. However, the predictability of the model is not sufficient to generate a profitable trading strategy, thus, Forex market turns out to be efficient, at least most of the time.
This paper presents a novel approach to monitor control performance of nonlinear processes that can be described as state-dependent models (SDMs). A discrete Kalman filter (KF) is established to estimate the SDM parameters. A covariance control formulation is introduced to split the system closed-loop variance/covariance into two terms, one term to account for the minimum expected quadratic loss bound (equivalent to the minimum variance performance bound but in state space formulation), and another to account for performance deviations from the minimum variance bound. Simulation studies have been conducted on several nonlinear process systems including a cold rolling mill model with roll eccentricity and a steel making system with real time oxyfuel slab reheating furnace control data. The case study results demonstrate the computational efficiency of the proposed strategy in real time monitoring and control of systems with fast, nonlinear and time-varying dynamics.
In business applications such as direct marketing, decision-makers are required to choose the action which best maximizes a utility function. Cost-sensitive learning methods can help them achieve this goal. In this paper, we introduce Pessimistic Active Learning (PAL). PAL employs a novel pessimistic measure, which relies on confidence intervals and is used to balance the exploration/exploitation trade-off. In order to acquire an initial sample of labeled data, PAL applies orthogonal arrays of fractional factorial design. PAL was tested on ten datasets using a decision tree inducer. A comparison of these results to those of other methods indicates PAL's superiority.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.