Identifying splicing sites in eukaryotic RNA: support vector machine approach

Sun, Yingfei; Fan, Xiaodan; Li, Yanda

doi:10.1016/s0010-4825(02)00057-4

Cited by 55 publications

(32 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our model is based on Support Vector Machines (SVMs), which are supervised learning algorithms that, given a set of features and a binary classification (e.g., positive and negative cases), find the combination of features that provides an optimal separation between the instances of the two classes (see, e.g., Ben-Hur et al 2008). SVMs are widely used in computational biology and have been shown to achieve high accuracy in a variety of problems, including the prediction of splice sites (Sun et al 2003;Yamamura and Gotoh 2003;Zhang et al 2003;Sonnenburg et al 2007) and alternative exons (Dror et al 2005).…”

Section: Introductionmentioning

confidence: 99%

RNA secondary structure mediates alternative 3′ss selection in Saccharomyces cerevisiae

Plass¹,

Codony-Servat²,

Ferreira³

et al. 2012

RNA

View full text Add to dashboard Cite

Alternative splicing is the mechanism by which different combinations of exons in the pre-mRNA give rise to distinct mature mRNAs. This process is mediated by splicing factors that bind the pre-mRNA and affect the recognition of its splicing signals. Saccharomyces species lack many of the regulatory factors present in metazoans. Accordingly, it is generally assumed that the amount of alternative splicing is limited. However, there is recent compelling evidence that yeast have functional alternative splicing, mainly in response to environmental conditions. We have previously shown that sequence and structure properties of the pre-mRNA could explain the selection of 39 splice sites (ss) in Saccharomyces cerevisiae. In this work, we extend our previous observations to build a computational classifier that explains most of the annotated 39ss in the CDS and 59 UTR of this organism. Moreover, we show that the same rules can explain the selection of alternative 39ss. Experimental validation of a number of predicted alternative 39ss shows that their usage is low compared to annotated 39ss. The majority of these alternative 39ss introduce premature termination codons (PTCs), suggesting a role in expression regulation. Furthermore, a genome-wide analysis of the effect of temperature, followed by experimental validation, yields only a small number of changes, indicating that this type of regulation is not widespread. Our results are consistent with the presence of alternative 39ss selection in yeast mediated by the pre-mRNA structure, which can be responsive to external cues, like temperature, and is possibly related to the control of gene expression.

show abstract

Section: Introductionmentioning

confidence: 99%

RNA secondary structure mediates alternative 3′ss selection in Saccharomyces cerevisiae

Plass¹,

Codony-Servat²,

Ferreira³

et al. 2012

RNA

View full text Add to dashboard Cite

show abstract

“…In this paper, we totally consider 8 encoding methods: MCM [4], MCM with DTF, MCM with UTF, WAM [4], WAM with DTF [2], WAM with UTF and 4-bit [7]- [8], 16-bit [6] binary vector encoding. MCM and WAM encoding method only consider the information contained in true donor (resp.…”

Section: Performance Comparisonmentioning

confidence: 99%

“…Shortly after its introduction, its performance has already either matched or outperformed that of traditional machine learning approaches (e.g., NN) for a wide range of applications including splice sites prediction [2]- [7]. Currently, the SVM approach mainly deals with numerical data (with the exception of special kernel functions), so the DNA sequences must be encoded beforehand in some way.…”

Section: Introductionmentioning

confidence: 99%

Learn from the Information Contained in the False Splice Sites as well as in the True Splice Sites using SVM

Xu¹,

Ma²,

Tao³

2007

Proceedings on Intelligent Systems and Knowledge Engineering (ISKE2007)

View full text Add to dashboard Cite

In splice sites prediction, the information contained in false splice sites is often ignored, which has been recognized to be very valuable. In this paper, three novel encoding approaches, MCM with DTF, MCM with UTF and WAM with UTF, are described, all of which consider the information both in true and false splice sites. From the comparison with 5 other encoding methods, we can conclude: (1) SVM can benefit from the information contained in false splice sites as well as in true splice sites. (2) The performance of MCM with DTF and WAM with DTF is comparative, both of which give the better performance nearly in all cases. (3) The performance of binary vector encoding method is surprisingly good, the potential of which need to be further investigated.

show abstract

“…2, February 2015 120 A Markov model is a model of discrete stochastic process that evolves through the states from the set S = {s 1 , s 2 , …, s n }. The main assumption is that the probability of appearance of any future state depends only on the k preceding states, for some constant k. Given a learning set of sequences, a Markov model can be built by computing the probability that a certain nucleotide x i appears after a sequence s i , for example,GeneMark family detects genes by identifying open reading frames (the regions between start and stop codons) using precomputed species-specific gene models as training data to determine parameters of the protein-coding and non-coding regions.The [19]. More than 90% of nucleotides can be correctly identified as either coding, or non-coding.…”

mentioning

confidence: 99%

“…Machine learning and data mining methods have been successfully applied to various kinds of prediction problems such as exon prediction [16], start codon prediction [17], and splice site prediction [18], [19]. More than 90% of nucleotides can be correctly identified as either coding, or non-coding.…”

mentioning

confidence: 99%

Constraint-Based System for Genomic Analysis

Kerdprasop¹,

Kerdprasop²

2015

IJIET

View full text Add to dashboard Cite

Abstract-Recent advent of the new high-throughput biological technologies has brought more challenges to the computer science community in terms of the amount and variety of biological data awaiting for analysis. Computationally intensive techniques such as pattern recognition and machine learning algorithms have been applied to extract knowledge from several biological domains ranging from genomics, proteomics to system biology and evolution process. Learning techniques applied to the computational biology are mostly in the category of classification. Therefore, the sequence analysis problem has to be formulated as classification task, which is quite difficult due to the unobvious one-to-one mapping of the problem. In this paper, we propose a different setting of sequence analysis formulation based on the nucleotide patterns using a constraint logic programming paradigm, in which the sequence alignment can be performed through pattern matching techniques. With available knowledge from the field of pattern mining, we can apply the well-established techniques within the new framework of constraint programming. However, to make the system efficiently work, we need a new set of constraint solver algorithms specifically designed for the sequence analysis problem. The design and implementation of such algorithms are thus the main focus of our research project. We propose in this paper the design of a constraint-based system for genomic sequence analysis including the algorithm for the constraint solver, a major part of the proposed system. Index Terms-Genomic sequence analysis, constraint-based system, constraint solver algorithm, constraint programming. I. INTRODUCTIONLiving organisms contain multiples cells to perform different functions. There are two basic types of cells: prokaryote cells (found in bacteria) and eukaryote cells (appeared in plants and animals). Contained within the cell membrane are several organelles and thousands of different types of molecules, the important one is DNA (deoxyribonucleic acid) that carries the entire genetic inheritance, or genes, of the cell. DNA is a long polymer molecule that contains sugar, phosphate group, and a mixture of four different nucleotide bases: adenine (A), cytosine (C), guanine (G), and thymine (T).In genetic information in biological systems (Fig. 1) that firstly DNA is copied to more DNA in the replication process, then DNA is transcribed into mRNA (or messenger-RNA) in a transcription process and finally mRNA is translated (by ribosome) into protein in a translation process.This overall process of biological protein synthesis is known as gene expression. Understanding the process of gene expression in different types of cells and under different conditions is one of the fundamental research aspects of genomics, which is all the studies related to genes.In prokaryotes, genetic information is encoded continuously on a DNA strand. But in eukaryotes, regions that code for protein (called exons) are interrupted by the non-coding regions (called introns). During t...

show abstract

Identifying splicing sites in eukaryotic RNA: support vector machine approach

Cited by 55 publications

References 16 publications

RNA secondary structure mediates alternative 3′ss selection in Saccharomyces cerevisiae

RNA secondary structure mediates alternative 3′ss selection in Saccharomyces cerevisiae

Learn from the Information Contained in the False Splice Sites as well as in the True Splice Sites using SVM

Constraint-Based System for Genomic Analysis

Contact Info

Product

Resources

About