We provide a unified overview of methods that currently are widely used to assess the accuracy of prediction algorithms, from raw percentages, quadratic error measures and other distances, and correlation coefficients, and to information theoretic measures such as relative entropy and mutual information. We briefly discuss the advantages and disadvantages of each approach. For classification tasks, we derive new learning algorithms for the design of prediction systems by directly optimising the correlation coefficient. We observe and prove several results relating sensitivity and specificity of optimal systems. While the principles are general, we illustrate the applicability on specific problems such as protein secondary structure and signal peptide prediction.
Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences.Comparative analysis of primary sequence information is a major tool in the elucidation of the molecular mechanisms of replication and evolution of organisms and the structure and function of proteins. For the simple case of pairwise sequence comparison, good algorithms exist (see refs. 1 and 2 for recent reviews) that can align two sequences of length N in roughly O(N2) steps. Most of these algorithms are based on dynamic programming (3), with location-independent substitution and gap penalties. Unfortunately, when dynamic programming is applied to a family of K sequences its behavior scales like O(NK), exponentially in the number of sequences (4).A number of algorithms have been devised to try to tackle the multiple alignment problem (see refs. 5-7 for some of the most recent ones). Most protein sequence relationships exhibiting >50%o identical residues can be aligned by several of these algorithms. Many of the most interesting protein families, however, exhibit conservation far below 50%o identity.To date, alignment methods have not been developed that can correctly identify all the motifs that define each protein family (2).Here, we apply a different approach, based on hidden Markov models (HMMs), to the problem of modeling and aligning a family by using primary structure information only. Initial results were presented (8). Markov models and the related expectation-maximization (EM) (9) algorithm in statistics have already been applied to biocomputational problems (10-13). Krogh et al. (14) were the first to demonstrate the power of a similar method on the globin family. Rather than starting from pairwise alignments, the approach seeks to take advantage of the massive amount of information typically present in a family with a flexible use of positiondependent parameters. A new algorithm is introduced for the iterative adjustments of the parameters of the models. The algorithm is used here to model three protein families:globins, immunoglobulins, and kinases.tt HMMs and Learning A first-order discrete HMM (15) is completely defined by a set of states S, an alphabet of m symbols, a probability transition matrix T = (tv), and a probability emission matrix E = (eta). When the system is in state i, it has a probability t(/ of moving to state] and a probability eia of emitting symbol a. Only the output s...
After collecting a data base of fingerprint images, we design a neural network algorithm for fingerprint recognition. When presented with a pair of fingerprint images, the algorithm outputs an estimate of the probability that the two images originate from the same finger. In one experiment, the neural network is trained using a few hundred pairs of images and its performance is subsequently tested using several thousand pairs of images originated from a subset of the data base corresponding to 20 individuals. The error rate currently achieved is less than 0.5%. Additional results, extensions, and possible applications are also briefly discussed.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.